[PDF] PetFMM--A dynamically load-balancing parallel fast multipole library

Abstract

Fast algorithms for the computation of N -body problems can be broadly classified into mesh-based interpolation methods, and hierarchical or multiresolution methods. To this last class belongs the well-known fast multipole method (FMM), which offers O(N) complexity. This paper presents an extensible parallel library for N -body interactions utilizing the FMM algorithm, built on the framework of PETSc. A prominent feature of this library is that it is designed to be extensible, with a view to unifying efforts involving many algorithms based on the same principles as the FMM and enabling easy development of scientific application codes. The paper also details an exhaustive model for the computation of tree-based N -body algorithms in parallel, including both work estimates and communications estimates. With this model, we are able to implement a method to provide automatic, a priori load balancing of the parallel execution, achieving optimal distribution of the computational work among processors and minimal inter-processor communications. Using a client application that performs the calculation of velocity induced by N vortex particles, ample verification and testing of the library was performed. Strong scaling results are presented with close to a million particles in up to 64 processors, including both speedup and parallel efficiency. The library is currently able to achieve over 85% parallel efficiency for 64 processors. The software library is open source under the PETSc license; this guarantees the maximum impact to the scientific community and encourages peer-based collaboration for the extensions and applications.

Full PDF

PPetFMM—A dynamically load-balancingparallel fast multipole library

Felipe A. Cruz a Matthew G. Knepley b L. A. Barba c , ∗ a Department of Mathematics, University of Bristol, England BS8 1TW b Computation Institute, Argonne National Laboratory and University of Chicago c Department of Mechanical Engineering, Boston University, Boston MA 02215

Abstract

Fast algorithms for the computation of N -body problems can be broadly classiﬁedinto mesh-based interpolation methods, and hierarchical or multiresolution meth-ods. To this last class belongs the well-known fast multipole method ( fmm ), whichoﬀers O ( N ) complexity. The fmm is a complex algorithm, and the programmingdiﬃculty associated with it has arguably diminished its impact, being a barrier foradoption. This paper presents an extensible parallel library for N -body interactionsutilizing the fmm algorithm, built on the framework of pets c. A prominent fea-ture of this library is that it is designed to be extensible, with a view to unifyingeﬀorts involving many algorithms based on the same principles as the fmm andenabling easy development of scientiﬁc application codes. The paper also details anexhaustive model for the computation of tree-based N -body algorithms in parallel,including both work estimates and communications estimates. With this model, weare able to implement a method to provide automatic, a priori load balancing ofthe parallel execution, achieving optimal distribution of the computational workamong processors and minimal inter-processor communications. Using a client ap-plication that performs the calculation of velocity induced by N vortex particles,ample veriﬁcation and testing of the library was performed. Strong scaling resultsare presented with close to a million particles in up to 64 processors, including bothspeedup and parallel eﬃciency. The largest problem size that has been run with the p et fmm library at this point was 64 million particles in 64 processors. The library iscurrently able to achieve over 90% parallel eﬃciency for 32 processors, and over 85%parallel eﬃciency for 64 processors. The software library is open source under the pets c license, even less restrictive than the bsd license; this guarantees the maxi-mum impact to the scientiﬁc community and encourages peer-based collaborationfor the extensions and applications. Key words: fast multipole method, order- N algorithms, particle methods, vortexmethod, hierarchical algorithms, parallel computing, dynamic load balancing Preprint submitted to Elsevier 15 May 2009 a r X i v : . [ c s . D C ] M a y Introduction

So-called N -body problems arise in many areas ( e.g. astrophysics, moleculardynamics, vortex methods, electrostatics). In these problems, the system isdescribed by a set of N particles, and the dynamics of the system is the resultof the interactions that occur for every pair of particles. To calculate all suchinteractions, the total number of operations required normally scales as N .One useful way to mathematically express an N -body problem is by means ofa matrix-vector multiplication, where the matrix is dense and represents theparticle interactions, and the vector corresponds to the weights of the parti-cles. Thus, the mat-vec product corresponds to the evaluation of all pairwiseinteractions. In a naive implementation, we would directly create the matrix incomputer memory and then perform the multiplication with the vector. Thisnaive approach would prove feasible only for small N , as the computationalrequirements in processing power and memory both grow as N . For this rea-son, many eﬀorts have been directed at producing eﬃcient implementations,capable of performing the mat-vec operation at reduced memory requirementsand operation counts.The various methods for more eﬃcient calculation of the particle interac-tion problem can be broadly classiﬁed in two types: mesh-based interpolationmethods, and hierarchical or multiresolution methods. The basic mesh-basedmethod is the particle-mesh ( pm ) approach, in which particle information isinterpolated onto a lattice, the ﬁeld of interest is solved on the mesh by meansof some grid-based method such as ﬁnite diﬀerence, and the ﬁeld informationis ﬁnally interpolated back to the particle locations [13]. (This method is alsocalled particle-in-cell, pic .) In some applications, such as molecular dynamics,the smoothing at the short-range introduced by the interpolations in the pm method is unacceptable. An alternative mesh-based method can then be usedin which all near-ﬁeld potentials are calculated directly, while far-ﬁeld eﬀectsare calculated with the pm method; this is called the particle-particle/particle-mesh method ( p m ).The second class of methods provides eﬃcient computation of the interactionsby means of a hierarchical or multilevel approach. The main subset of thisclass performs a hierarchical subdivision of the computational domain, whichis used to group particles at diﬀerent length scales, and then approximates theinteractions of clusters of particles using series expansions. The approximationis applied to far-ﬁeld interactions, while near-ﬁeld interactions are summed di-rectly. This type of methods can be considered meshfree, and includes: treecodes [5,2], the fast multipole method ( fmm ) [11] and its variations [7,1,23,9].An alternative multilevel method has been presented by Skeel [21], which ∗ Correspondence to: [email protected] p m , but unlike p m it provides acceleration via multiple grid levels, achieving O ( N ) complexity. Thus, this method applies the multilevel approach via ﬁelddecomposition, rather than spatial decomposition; it can perhaps be viewedas a hybrid of the mesh-based and hierarchical methods.In this paper, we present the theory and development of a parallel fast multi-pole library, p et fmm , belonging to the meshfree type of methods describedabove. The overarching goal is to unify the eﬀorts in the development of fmm -like methods into an open-source library, that provides a framework with thecapacity to accommodate memory eﬃciency, parallelism and the data struc-tures used for spatial decomposition. The software is implemented utilizing pets c, the parallel library for scientiﬁc computing developed over more than17 years at Argonne National Laboratory [3]. At this point, we have a com-plete implementation of the fmm in parallel, with dynamic load balancingprovided by means of an optimization approach—minimizing inter-node com-munications and per-node computational work. But a prominent feature ofthis implementation is that it is designed to be extensible , so that it can ef-fectively unify eﬀorts involving many algorithms which are based on the sameprinciples of the fmm . The perspectives for extensibility are described in thispaper as well.The development of this extensible parallel library for N -body interactionsis important due to the programming diﬃculty associated with the fmm al-gorithm, which has been a barrier for adoption and arguably diminished itspotential impact. A critical stage in the maturation of a computational ﬁeldbegins with the widespread availability of community software for the centralalgorithms. A good example is molecular dynamics, which blossomed afterintroduction of freely available packages, such as charmm [6] and namd [18].Such game-changing community software does not exist for particle methodsor for N -body computations in general.The present paper does not merely describe a new parallel strategy and im-plementation, it also details an exhaustive model for the computation of tree-based N -body algorithms in parallel. Our model is a signiﬁcant extension ofthe time model developed in [10], which assumed a uniform distribution of p et fmm stands for ‘portable extensible toolkit for fmm ’, as in pets c, which is‘portable extensible toolkit for scientiﬁc computing’. fmm was that of Greengard andGropp [10], on a shared memory computer. They also presented a timing modelfor a perfectly balanced implementation, of which we say more in §

5. Morerecent versions of the parallel fmm have been presented in [24,12,17,16]. Manyof these codes produce a partition of the data among processors based upon aspace-ﬁlling curve, as previously introduced for the case of parallel treecodesin [22]. Only one of these parallel fmm codes represents a supported, opensource code available for community use. The kifmm

3d code can be down-loaded and modiﬁed under the gpl license. It is not an implementation of theclassic fmm algorithm, however, but rather a kernel-independent version de-veloped by the authors. This algorithm does not utilize series expansions, butinstead uses values on a bounding surface of each subdomain obtained throughan iterative solution method; these ideas are based on [1]. The code does notappear to allow easy extension to traditional fmm , or to the other relatedalgorithms alluded to above. Moreover, it does not appear to be designed asan embedded library component, part of a larger multiphysics simulation.This paper is organized as follows. We present ﬁrst a brief overview of the fmm algorithm; this presentation is necessarily cursory, as a large body ofliterature has been written about this method. A basic description, however,is necessary to agree on a common terminology for the rest of the paper. Ourapproach is to illustrate the method using graphical representations. The fol-lowing section ( §

3) describes our client application code, the vortex particlemethod for simulation of incompressible ﬂow at high Reynolds numbers. Inthis method, the fmm is one of two approaches commonly used to obtain thevelocity of particles from the vorticity ﬁeld information; the second approach(as in other N -body problems) is to interpolate information back and forthfrom a mesh, while solving for the ﬁeld of interest on the mesh. Althoughequally eﬃcient—both can be O ( N )—the extensive use of interpolations mayintroduce numerical diﬀusion, which is undesirable in certain applications.Next, § §

5. The subsequent section ( §

6) presents details ofour software design, implementation and veriﬁcation carried out. Results ofcomputational experiments with the parallel software are presented in §

7, and4e end with some conclusions and remarks about future work.Note that a very simple and ﬂexible, yet eﬃcient and scalable code has beenproduced. The entire p et fmm implementation of fmm is only lines ofC ++ , including comments and blank lines. It is already freely downloadablefrom http://petsc.cs.iit.edu/petsc/ and we welcome correspondence withpotential users or those who wish to extend it for speciﬁc purposes. The fast multipole method ( fmm ) is an algorithm which accelerates compu-tations of the form: f ( y j ) = N (cid:88) i =1 c i K ( y j , x i ) (1)representing a ﬁeld value evaluated at point y j , where the ﬁeld is generatedby the inﬂuence of sources located at the set of centers { x i } . The sources areoften associated with particle-type objects, such as stellar masses, or chargedparticles. The evaluation of the ﬁeld at the centers themselves, therefore, repre-sents the well-known N -body problem. In summary: { y j } is a set of evaluationpoints, { x i } is a set of source points with weights given by c i , and K ( y, x ) isthe kernel that governs the interactions between evaluation and source parti-cles. The objective is to obtain the ﬁeld f at all the evaluation points, whichrequires in principle O ( N ) operations if both sets of points have N elements.Fast algorithms aim at obtaining f approximately with a reduced operationcount, ideally O ( N ).The fmm works by approximating the inﬂuence of a cluster of particles bya single collective representation, under the assumptions that the inﬂuenceof particles becomes weaker as the evaluation point is further away, i.e. , thekernel K ( y, x ) decays as | x − y | increases, and that the approximations areused to evaluate far distance interactions. To accomplish this, the fmm hier-archically decomposes the computational domain and then it represents theinﬂuence of sets of particles by a single approximate value. The hierarchicaldecomposition breaks up the domain at increasing levels of reﬁnement, andfor each level it identiﬁes a near and far sub-domain. By using the hierarchicaldecomposition, the far ﬁeld can be reconstructed as shown in Figure 1.Using the computational domain decomposition, the sum in Equation (1) isdecomposed as f ( y j ) = N near (cid:88) l =1 c l K ( y j , x l ) + N far (cid:88) k =1 c k K ( y j , x k ) (2)5here the right-most sum of (2), representing the far ﬁeld, is evaluated ap-proximately and eﬃciently.We now need to introduce the following terminology with respect to the math-ematical tools used to agglomerate the inﬂuence of clusters of particles: Multipole Expansion (ME): a series expansion truncated after p termswhich represents the inﬂuence of a cluster of particles, and is valid at dis-tances large with respect to the cluster radius. Local Expansion (LE): a truncated series expansion, valid only inside asub-domain, which is used to eﬃciently evaluate a group of MEs.In other words, the MEs and LEs are series ( e.g , Taylor series) that converge indiﬀerent sub-domains of space. The center of the series for an ME is the centerof the cluster of source particles, and it only converges outside the cluster ofparticles. In the case of an LE, the series is centered near an evaluation pointand converges locally.As an example, consider a particle interaction problem with decaying kernels,where a cluster of particles far away from an evaluation point is ‘seen’ at theevaluation point as a ‘pseudo-particle’, and thus its inﬂuence can be repre-sented by a single expression. For example, the gravitational potential of agalaxy far away can be expressed by a single quantity locally. Thus, by usingthe ME representing a cluster, the inﬂuence of that cluster can be rapidly eval-uated at a point located far away —as only the single inﬂuence of the ME needsto be evaluated, instead of the multiple inﬂuences of all the particles in thecluster. Moreover, for clusters of particles that are farther from the evaluationpoint, the pseudo-particle representing that cluster can be larger. This idea,illustrated in Figure 1(b), permits increased eﬃciency in the computation.The introduction of an aggregated representation of a cluster of particles, viathe multipole expansion, eﬀectively permits a decoupling of the inﬂuence ofthe source particles from the evaluation points. This is a key idea, resultingin the factorization of the computations of MEs that are centered at the samepoint, so that the kernel can be written as, K ( x i , y j ) = p (cid:88) m =0 a m ( x i ) f m ( y j ) (3)This factorization allows pre-computation of terms that can be re-used manytimes, thus increasing the eﬃciency of the overall computation. Similarly, thelocal expansion is used to decouple the inﬂuence of an ME from the evaluationpoints. A group of MEs can be factorized into a single LE so that one singleevaluation can be used at multiple points locally. By representing MEs as LEsone can eﬃciently evaluate a group of clusters in a group of evaluation points.6 .1 Hierarchical space decomposition In order to utilize the tools of MEs and LEs, a spatial decomposition schemeneeds to be provided. In other words, for a complete set of particles, we needto ﬁnd the clusters that will be used in conjunction with the MEs to approxi-mate the far ﬁeld, and the sub-domains where the LEs are going to be used toeﬃciently evaluate groups of MEs. This spatial decomposition is accomplishedby a hierarchical subdivision of space associated to a tree structure ( quadtree structure in two dimensions, or an octree structure in three dimensions) torepresent each subdivision. The nodes of the tree structure are used to de-ﬁne the spatial decomposition, and diﬀerent scales are obtained by lookingat diﬀerent levels of the tree. A tree representation of the space decomposi-tion allows us to express the decomposition independently of the number ofdimensions of the space. Consider Figure 1(a) where a quadtree decomposi-tion of the space is illustrated. The nodes of the tree at any given level coverthe whole domain. The relations between nodes of the tree, represent spatialreﬁnement. The domain covered by a parent box is further decomposed intosmaller sub-domains by its child nodes. Thus, in the fmm the tree structureis used to hierarchically decompose the space and the hierarchical space de-composition is used to represent the near-ﬁeld and far-ﬁeld domains. As anexample, consider Figure 1(b) where the near-ﬁeld for the black colored boxis represented by the light colored boxes, and the far-ﬁeld is composed by thedark colored boxes.

After the spatial decomposition stage, the fmm can be roughly summarizedin three stages: upward sweep , downward sweep , and evaluation step . In the upward sweep , the objective is to build the MEs for each node of the tree. TheMEs are built ﬁrst at the deepest level, level L , and then translated to thecenter of the parent nodes. Thus, the MEs at the higher levels do not haveto be computed from the particles, they are computed from the MEs of thechild nodes. In the downward sweep of the tree, the MEs are translated intoLEs for all the boxes in the interaction list . At each level, the interaction listcorresponds to the cells of the same level that are in the far ﬁeld for a givencell. Once the MEs have been translated into LEs, the LEs of upper levelsare translated and added up to obtain the complete far domain inﬂuence foreach box at the leaf level of the tree. At the end of the downward sweep ,each box will have an LE that represents the complete far-ﬁeld for the box.Finally, at the evaluation step , for every particle at every node at the deepestlevel of the tree, the ﬁnal ﬁeld is evaluated by adding the near-ﬁeld and far-ﬁeld contributions. The near ﬁeld of the particles at a given box is obtained7 a) Domain decomposition.(b) Near and Far ﬁeld.Fig. 1. Quadtree decomposition of a two-dimensional domain: (a) presents the hier-archical tree related to the full spatial decomposition of the domain; (b) presents acolored two dimensional spatial decomposition for the black box and its equivalenceon the tree. The near-ﬁeld is composed by the bright boxes and the black box itself,while the far-ﬁeld is composed by the dark colored boxes. Notice that the far-ﬁeldis composed of boxes of diﬀerent levels of the tree structures. The relations betweenthe nodes of the tree simplify the process of composing the near and far domains. by directly computing the interaction between all the particles in the neardomain of the box. The far ﬁeld of the particles is obtained by evaluating theLE of the box at each particle location.These ideas can be visualized with an illustration, as shown in Figure 2, whichwe call the “bird’s eye view” of the complete algorithm. The importance ofthis bird’s eye view is that it relates the algorithm computations to the datastructure used by the fmm . This will prove to be very useful when we discussthe parallel version that we have developed.This overview is only intended to introduce the main components of the fmm algorithm, so that we can discuss in the forthcoming sections our strategyfor parallelization. Therefore, it is not a detailed presentation (we have leftout all the mathematics), and readers interested in understanding all the de-tails should consult the original reference [11], and selections of the abundantliterature published since. 8 ownward SweepUpward Sweep Create Multipole Expansions. Evaluate Local Expansions.P2M M2M M2L L2L L2P

Fig. 2. Bird’s eye view of the fmm algorithm. The sketch illustrates the upwardsweep and the downward sweep stages on the tree. Each stage has been furtherdecomposed into the following substages:

P2M –transformation of particles intoMEs (particle-to-multipole);

M2M –translation of MEs (multipole-to-multipole);

M2L –transformation of an ME into an LE (multipole-to-local);

L2L –translationof an LE (local-to-local);

L2P –evaluation of a LEs at particle locations (local-to–particle).

The fast multipole method has many applications, such as calculation of thegravitational ﬁeld of many stellar bodies, or the electrostatic forces obtainedfrom many charged particles. In ﬂuid dynamics, one application is found inthe calculation of the velocity induced by many vortex particles, which inturn is used as a method of simulation for either the Euler or Navier-Stokesequations. The vortex particle method starts by discretizing the continuousvorticity ﬁeld, deﬁned as the curl of the velocity ( ω = ∇ × u ), over a set ofmoving nodes located at x i as follows: ω ( x, t ) ≈ ω σ ( x, t ) = N (cid:88) i γ i ζ σ ( x, x i ) (4)where a common choice for the basis function is a normalized Gaussian suchas: ζ σ ( x, y ) = 12 πσ exp (cid:32) −| x − y | σ (cid:33) (5)The discretized vorticity ﬁeld in (4) is used in conjunction with the vorticitytransport equation. For ideal ﬂows in two dimensions, the vorticity equationsimply expresses that the vorticity is a preserved quantity over particle tra-9ectories: ∂ω∂t + u · ∇ ω = D ω D t = 0 (6)Therefore, the moving nodes can translate with their local velocity and carrytheir vorticity value to automatically satisfy the transport equation. The onlymissing ingredient is obtaining the velocity from the discretized vorticity,which is accomplished using the Biot-Savart law of vorticity dynamics: u ( x, t ) = (cid:90) ( ∇ × G )( x − x (cid:48) ) ω ( x (cid:48) , t ) dx (cid:48) = (cid:90) K ( x − x (cid:48) ) ω ( x (cid:48) , t ) dx (cid:48) = ( K ∗ ω )( x, t )where K = ∇ × G is the Biot-Savart kernel, with G the Green’s function forthe Poisson equation, and ∗ representing convolution. For example, in 2D theBiot-Savart law is written explicitly as, u ( x, t ) = − π (cid:90) ( x − x (cid:48) ) × ω ( x (cid:48) , t ) ˆk | x − x (cid:48) | dx (cid:48) . (7)When the vorticity is expressed as a radial basis function expansion, one canalways ﬁnd an analytic integral for the Biot-Savart velocity, resulting in anexpression for the velocity at each node which is a sum over all particles. Usingthe Gaussian basis function (5), we have: K σ ( x ) = 12 π | x | ( − x , x ) (cid:32) − exp (cid:32) − | x | σ (cid:33)(cid:33) . (8)where | x | = x + x . Thus, the formula for the discrete Biot-Savart law intwo dimensions gives the velocity as follows, u σ ( x, t ) = N (cid:88) j =1 γ j K σ ( x − x j ) . (9)Therefore, the calculation of the velocity of N vortex particles is an N -bodyproblem, where the kernel K ( x ) decays with distance, which makes it a can-didate for acceleration using the fmm . Also note that as | x | becomes large,the kernel K ( x ) approaches 1 / | x | . We take advantage of this fact to use themultipole expansions of the 1 / | x | kernel as an approximation, while the near-ﬁeld direct interactions are obtained with the exact kernel K . In a separatework, we have investigated the errors of the fmm for the vortex method, andhave demonstrated that using the expansions for 1 / | x | does not impact onaccuracy, as long as the local interaction boxes are not too small [8].10 Parallelization strategy

The goal of our parallel strategy is to achieve an optimal distribution of thecomputational work among processors and a minimal communication require-ment. We achieve this goal by means of a two-step process: in the ﬁrst step, wedecompose the complete fmm into basic algorithmic elements ; in the secondstep, we distribute those algorithmic elements across the available processingelements (processors) in an optimal way. By optimal, we mean that when dis-tributing the basic algorithmic elements we optimize their assignment to theprocessing elements by:(1) Load balance: The work performed by each processing element is ade-quate to the processor’s capabilities.(2) Minimization of communications: The total amount of communicationbetween diﬀerent processing elements is minimized.An important ingredient is that both steps of the above-mentioned processare carried out automatically by an optimization tool, without intervention ofthe user, with the objective of using the available resources with maximumeﬃciency. The load balancing provided by our method is applied dynamically, a priori of the actual computation.We are aware of one previous attempt to provide automatic load balancing inan fmm implementation. The dpmta code[19] (not maintained or supportedsince ) is a parallel implementation of the fmm algorithm aimed at molec-ular dynamics applications. In this code, an initial uniform partitioning of thedata at the deepest level of the tree is carried out. Then, elapsed wall clocktimes are obtained for each of the processors for the equal partitioning, anda data migration scheme is applied to counter load imbalance. Experimentswere reported in [20] for N =200,000 in 12 processors, showing for the originalequal partition elapsed wall clock times for each processor between 60 and140 seconds. After load balancing, elapsed times were between 80 and 100 sec-onds, indicating a signiﬁcant improvement. This approach, however, requiresmultiple calculations with the same set of particles; it does not seem to bepossible to use this approach in an ever-evolving particle conﬁguration, whereeach conﬁguration is calculated only once. There was also no theory or compu-tational model of work and communications provided in relation to this code.The experiments cited, however, do provide clear evidence that a straightfor-ward uniform data partition (accomplished using a space-ﬁlling curve indexingscheme) can result in considerable load imbalance.In our implementation, we utilize the tree structure associated with the hi-erarchical decomposition of the domain in order to decompose the fmm intobasic algorithmic elements. The tree structure has many roles: it is used as a11

2M and L2L translations M2L transformation Local domainLevel k Root treeSub-tree Sub-tree Sub-tree Sub-tree Sub-tree Sub-tree Sub-tree Sub-tree Fig. 3. Illustration of the data usage patterns and partitioning of a binary-tree. Inthis ﬁgure, the tree has been cut at level k = 3. All data usage patterns that takeplace between nodes of the tree are illustrated by arrows. The subtrees generatedafter the partition of the tree are represented by the boxes. Communication betweensubtrees are generated by data usage patterns that take place between two diﬀerentsubtrees. space partitioner for the particles, it organizes the storage for the multipoleexpansions and local expansions, and it indicates the relations between nodesin the same level of the tree (if two nodes are from the local domain list orthe interaction list). But, most importantly, it can be used to represent thecomplete algorithm, as mentioned in § algorithm ; one can see that in Figure 2 all parts of the algorithm arerepresented. This view is the basis for our parallelization strategy, as describedbelow.The sub-division of the whole algorithm occurs by “cutting” the d -dimensionaltree at a certain level k , as shown in Figure 3. This procedure produces a root tree, that contains the ﬁrst k levels of the original tree, and 2 dk local trees, eachcorresponding to one of the lower branches of the original tree. The objectiveof this partitioning strategy is to obtain more subtrees than the number ofavailable processes, so that the subtrees can be optimally distributed acrossthe processes.In the bird’s eye view of the whole algorithm, Figure 2, data and computationsare related to nodes of the tree structure. When partitioning the tree repre-sentation into subtrees, computations that require data from diﬀerent nodesof the partitioned tree might access data from several diﬀerent partitions. Ifthis is the case, communication between partitions will happen as illustratedin Figure 3. By relating computations to nodes of the tree, the work carriedout by each partition and the communication between diﬀerent partitions can12 ij w i w j Fig. 4. Sketch of an undirected graph representation of the original hierarchicaltree. The w i weights represent the work performed by the subtree i and the c ij weights represent the communication between subtrees i and j . The graph is thenpartitioned into a number of parts equal to the available processes; the red lines inthe ﬁgure show a possible partitioning of the graph. The partitions are optimizedso that the weights w i assigned to a partition are balanced with respect to otherpartitions, and the cost of the cut edges is minimal. be estimated, which is then used to optimally distribute the partitions overavailable processors.In order to assign subtrees to processors, we build a graph representation fromthe partitioned tree, as illustrated on Figure 4. The graph is assembled suchthat the vertices of the graph correspond to the subtrees and the edges cor-respond to communication between subtrees. Using the graph representation,we can assign weights to the vertices which are proportional to the amountof computational work performed by each subtree, and assign weights to theedges which are proportional to the amount of communication that happensbetween two subtrees.The load balancing in the parallel algorithm is done by partitioning theweighted graph into as many parts as the number of available processors,and then assigning the subtrees to the processors according to the graph par-titions. The problem of obtaining partitions, such that they are well-balancedand minimize the communication, is solved by a graph partitioning tool suchas p ar metis [14]. p ar metis is an open source graph partitioning tool, and isused by p et fmm to create near optimal partitions. Figure 5 demonstrates theload balancing scheme at work. In an example computation, particles havebeen placed inside a square-shaped domain that is then hierarchically decom-posed into a tree representation. The tree is then cut at level k = 4, resultingin 256 parallel subtrees that are then distributed among 16 processors. In thenext section we develop the model of the parallel algorithm, obtaining thenecessary weights for computational work and communication.13 h h h h h h h h h h h h h h h h h c c c c d d d d d d d d d dh h h h h h h h h h h h h h h h h c c c c c d d d d d d d d d dh h g h h h h h h h h h h h h f h c c c c c d d d d d d d d d dg g g g h h h h e e h h h h f f c c c c c c c d d d d d d d d dg g g g g h h h e e e e f f f f c c c c c c c c d d d d d d d dg g g g g h h h e e e e f f f f c c c c c c c c d d d d d d d dg g g g g g g g g e e e f f f f c c c c c c c c d d d a a a d dg g g g g g g g e e e e f f f f f c c c c c c c d d a a a a a ag g g g g g g g e e e e f f f f f c c c c c c c d d a a a a a ag g g g g g g e e e e e f f f f f f c c b c c c a a a a a a a ag g g g g g e e e e e e f f f f f f b b b b b b a a a a a a a ag g g g e e e e e e e f f f f f f f b b b b b b a a a a a a a ag g g g e e e e e e e f f f f f f f b b b b b b b a a a a a a ai g g g e e e e e e e f f f f f f f b b b b b b b b a a a a a ai i i i e e e e e e e e f f f f f f b b b b b b b b a a a a a ai i i i i i e j j e e j j j j b b b b b b b b b b b b a a a a ai i i i i i i i j j j j j j j j b b b b b b b b b b b a p p p pi i i i i i i i j j j j j j j j n b b b b b n p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n n p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n n p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n p p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n p p p p p p p p p pi i k k k k i i j j j j j j j j n n n n n n n p p p p p p p p pk k k k k k k i l l l l j j n n n n n n n n m m p p p p o o o ok k k k k k l l l l l l l l n n n n n n n m m m o o o o o o o ok k k k k k l l l l l l l l n n n n n n n m m m o o o o o o o ok k k k k k l l l l l l l l n m n n n n n m m m o o o o o o o ok k k k k k l l l l l l l l m m m m n m m m m m m o o o o o o ok k k k k k k l l l l l l l m m m m m m m m m m m o o o o o o ok k k k k k k k l l l l l l m m m m m m m m m m m o o o o o o ok k k k k k k l l l l l l l m m m m m m m m m m m o o o o o o ok k k k k k k l l l l l l l l m m m m m m m m m o o o o o o o o Fig. 5. Illustration of the outcome of the automatic load balancing for a uniformspatial distribution of particles in a square domain. The cells represent the subtreesobtained after the partitioning of the fmm tree. The cells have been labeled andcolored to represent the partition to which they have been assigned after the loadbalancing. In total, 256 subtrees have been distributed among 16 partitions.

The seminal analysis of the parallel complexity of the fmm was given byGreengard and Gropp in 1990 [10]. They develop a model for the runningtime T in terms on the number of particles N , the number of processors P ,and the number of boxes at the ﬁnest level B , T = a NP + b log P + c NBP + d N BP + e ( N, P ) . (10)They account here for perfectly parallel work, represented by a , such as themultipole expansion initialization and local expansion evaluation; the reduc-14ion bottleneck, b , such as that associated to multipole-to-multipole transla-tions; multipole-to-local transformations and translations, c ; direct interac-tions, d ; and lower order terms, e .In some respects, however, the analysis in [10] was limited. All the speciﬁc casesanalyzed assumed a uniform distribution of particles at the ﬁnest level. As aconsequence, no strategy is oﬀered for dealing with non-uniform distributions,and computational imbalance is not addressed. The volume of communicationfor a certain partition of work is also not estimated in this model. Below, weextend the model by giving estimates for both communication volume andcomputational load. The new, extended model will allow us to generate anunstructured data distribution which is optimal in terms of load balance andcommunication overhead. In Section 4, we discussed the parallelization strategy, where the tree represen-tation of the fmm algorithm is decomposed into a set of subtrees. We now dis-cuss in more detail the data communication that takes place between subtrees.Figure 3 presents a simpliﬁed illustration of the communications showing allthe data usage patterns across diﬀerent parts of the algorithm. The illustrationhelps us in the identiﬁcation of three types of communications that take placebetween two subtrees: multipole-to-multipole translations (M2M), multipole-to-local transformations (M2L), and local-to-local translations (L2L). Figure 4is an equivalent graph representation of the partitioned tree from Figure 3,translating graphically the hierarchical tree into a set of vertices and edges: thesubtrees are represented as vertices and the communication between subtreescorrespond to the edges of the graph.In order to determine the edge weights of the graph, it is necessary to identifyall the communications required between nodes of the fmm tree associated todiﬀerent vertices of the equivalent graph. Communications between sub-treesat diﬀerent stages of the fmm algorithm can be classiﬁed into two types: com-munication of particles in the local domain, and communication of expansionterms . This last type occurs at M2M and L2L translations, and M2L transfor-mations. A complete picture of the communication patterns for a binary-treecan be seen in Figure 3. We note that the communication patterns for theM2M and L2L translations occur only from subtrees to root tree and viceversa, while no communication between subtrees takes place for these opera-tions.Quantiﬁcation of the amount of communication between each pair of vertices isrequired. For each communication pattern, we deﬁne a communication matrix,15here each element c ij of the matrix indicates the amount of communicationfrom vertex i to vertex j . We now consider the two-dimensional case, where aquadtree is used. By studying the communication patterns between nodes inthe quadtree, we arrive at estimates of the communications between subtreesof the truncated tree. Thus, we produce the following estimate for the amountof communication between two lateral neighboring subtrees of a quadtree, L (cid:88) n = k +1 α comm (2) n − k ∗ L is the depth of the fmm tree, k is the cut level of the quadtree, and α comm is a constant depending on the expansion order p and the size of ﬂoatingpoint numbers used. In the same way, we obtain the following estimate for theamount of communication between two diagonal neighboring subtrees, α comm (( k − L ) − ∗ z -ordernumbering of the nodes is used to discover the neighbor sets for every vertexof the graph without any communication between processes. All communica-tion occurs between neighboring local trees, so we can ﬁll the communicationmatrix as follows, For all node j at level k:For all node i from the neighbor set of j:if i is a lateral box:c[i][j] += lateral node estimateelsec[i][j] += diagonal node estimate

In Section 2, we decomposed the fmm algorithm into three stages: upwardsweep, downward sweep, and the evaluation step. In this section we discusswork estimates following the same decomposition.In the upward sweep, the MEs for all the nodes of the fmm tree are built. Ittakes O ( N i p ) operations to build the MEs for a node that is in the maximumlevel of reﬁnement (or the leaf level of the fmm tree), where N i correspondsto the number of particles associated to the box i , and p is the number ofexpansion terms retained. To obtain the MEs for the upper levels of the fmm tree, each node of the upper levels translates into its center the MEs of their16hild nodes. The translation of a single ME has cost O ( p ) and if a node has n c child boxes the work performed by a node is O ( n c p ).In the downward sweep, the LEs that represent the complete far ﬁeld of aleaf node are built. For each local cluster, LEs are built for all clusters in itsinteraction list. If the interaction list (IL) has n IL members and the cost to transform an ME into an LE is O ( p ) then, for each node of the fmm tree, thetotal work to obtain the LEs for its interaction list is O ( p n IL ). After all theMEs are converted into LEs, the LEs are propagated from the parent box tothe child boxes. For each non-leaf node of the fmm tree, the LE is translatedfrom a node’s center into its n c child boxes. The work to translate a single LEis O ( p ), thus the work performed by a node is O ( n c p ).In the evaluation step, the complete ﬁeld is obtained for all the particles of thesystem, where the near ﬁeld is computed from the local direct interactions andthe far ﬁeld is obtained from the LEs. The near domain is given by the leafnode that contains the particle. If, for a box, the near domain is given by n nd nodes, each with N i particles, the cost of computing all pairwise interactionsfor a leaf node is O ( n nd N i ). The far ﬁeld of a particle is obtained from theevaluation of the LE of the box that contains the particle. If a box has N i particles, the cost of evaluating all the far ﬁeld for all the N i particles is O ( N i p ) where p is the number of terms in the LE.In summary, the amount of work done by a non-leaf node of the fmm tree is: O ( p (2 n c + n IL )) (13)and the amount of work performed by a leaf node is: O (2 N i p + p n IL + n nd N i ) (14)where n c corresponds to the number of children of a box, p is the number ofexpansion coeﬃcients retained, n IL is the estimate of the number of membersin the interaction list, and N i is the number of particles in the box i .When partitioning the fmm tree, T sub-trees are created, each of which hasthe same number of levels L st . The estimate for the total amount of workperformed by each subtree depends on its number of nodes and its numberof particles at the leaf level. By means of equations (13) and (14), we canestimate the amount of work performed by any subtree and use this as weightfor the corresponding vertex of the graph: (cid:32) L st − (cid:88) l =0 dl p (2 n c + n IL ) (cid:33) + 2 d ( L − (cid:16) N i p + p n IL + n nd N i (cid:17) (15)17 able 1Memory usage for serial quadtree structures.Type Bookkeeping Memory (bytes) Data Memory (bytes)Box centers 0 8 d ΛInteraction boxes (2 ∗ ∗ ∗ d + 16 p )ΛMultipole coeﬃcients 0 16 p ΛTemporary coeﬃcients 0 16 p ΛLocal coeﬃcients 0 16 p ΛLocal particles (2 ∗ BN Neighbor particles (2 ∗ Bs dL We will ﬁrst estimate memory usage in the serial case, and then extend theseresults to the parallel code, since most of the computational structures arepreserved. The quadtree structure stores data of three types: O (1) storage,such as sizes, data across boxes of the ﬁnest level, and data across all nodesof the tree. Over the entire tree, we store the box centers, the interactionlist and associated values, and three expansion coeﬃcients. We store localand neighbor particles only at the ﬁnest level of the tree. Let d be the spacedimension, L be the maximum level, p the number of expansion terms, N thenumber of particles, B = 28 the size of a particle in bytes, s the maximumnumber of particles per box, and let Λ be the total number of boxes in thetree, given by Λ = L (cid:88) dl = 2 d ( L +1) − . The maximum memory usage is given in Table 1. Notice that even for verynonuniform distributions, the memory usage for neighbor particles is boundedabove by a constant multiplied by the total number of particles. Thus thememory usage is linear in the number of boxes at the ﬁnest level and thenumber of particles.In the parallel case, we reuse our serial data structures completely, and thus weonly need to estimate the memory usage from explicitly parallel constructs. Wemaintain both a partition of boxes, and its inverse. We also have overlap [15]structures, essentially box-to-box maps between processes, for both neighborparticles and interaction list values. Let P be the number of processes, N lt the maximum number of local trees, N bd the maximum number of boundaryboxes for a process, A = 108 the size of an arrow in the overlap structure. The18 able 2Memory usage for parallel quadtree structures.Type Bookkeeping Memory (bytes) Data Memory (bytes)Partition (2 ∗ P N lt Inverse partition 0 4 N lt Neighbor send overlap N/A N bd sA Neighbor recv overlap N/A N bd sA Interaction send overlap N/A 27 N bd A Interaction recv overlap N/A 27 N bd A maximum memory usage per process is given in Table 2. The neighbor overlapstructure is bounded by the total number of particles exchanged, whereasthe interaction list overlap is bounded by the maximum cut size, a constantmultiplied by N lt .Future improvements may include memory savings through computing valuesdynamically. For instance, we need not store the interaction list boxes, as theycan be generated from the global box number. Also, for the case of uniformspacing, the box centers can be determined on the ﬂy. Neighboring particles,except for those from other processes, could be retrieved from the local particlestorage. The inverse partition of boxes could be calculated on the ﬂy, albeitwith communication. The p et fmm library was designed to oﬀer both high serial performance andscalability, but also to be easily integrated into existing codes and maintained.The serial code is completely reused in the parallel setting so that we are neverrequired to maintain two versions of the same algorithm. p et fmm leveragesexisting packages to keep its own code base small and clear. Parallel datamovement is handled by the Sieve package [15], while load and communicationare balanced using p ar metis , as described in the previous sections.The problem geometry is encapsulated in the Quadtree class. All relations,such as neighbors and interaction lists, can be dynamically generated so thatwe need only store data across the cells. Data is stored in

Sieve

Section objects, which allow easy communication across unstructured boundaries with19rbitrary layouts. The

ParallelQuadtree class presents an identical interface,but manages a set of local

Quadtrees and a root

Quadtree to mimic thesemethods.All serial computation is encapsulated in the

Evaluator class, which op-erates over a

Quadtree . It is templated over a

Kernel object, which pro-duces the required expansions and translation operators, so that we can eas-ily replace one equation with another. The

ParallelEvaluator , which in-herits from

Evaluator , overrides some methods in order to operate over a

ParallelQuadtree . All computation in the parallel code is still done by serialstructures, and only data partitioning and movement is accomplished by newcode, mainly derived from the

Sieve library.

From an initial, serial Python implementation of the fmm algorithm , we pro-duced a parallel C ++ version in little more than two weeks (with subsequentdebugging and veriﬁcation over several more weeks). The pets c Sieve frame-work [15] greatly aided parallelization. We were also careful at each stage tointroduce unit testing and verify our results against those of the well-studiedPython code [8]. We developed a ﬁle format for result veriﬁcation, consistingof (cid:46)

The number of levels, terms, particles, and tree coordinates (cid:46)

Particle assignment to quadtree cells (cid:46)

Center, number of particles, children, neighbors for each box (cid:46)

Interaction list and coeﬃcients for each box (cid:46)

The direct and fmm solutionsAll domain boxes were labeled with global numbers in the output, which en-ables the parallel code to be compared with the serial implementations. More-over, the box output may come in any order, making parallel output trivial.A small Python script then compares two output ﬁles, noting any discrepan-cies. In this way we compared the initial Python code with the C ++ version,both serial and parallel, and even among parallel runs with diﬀerent numberof processes. Conﬁdence in our results enabled us to make rapid changes, in-troducing new algorithms such as the improved partitioning scheme withoutlengthy debugging. Available for free download at http://code.google.com/p/pyfmm/ Computational experiments with the parallel software

The results presented here were obtained on the BlueCrystal–I system at theAdvanced Computing Research Centre, University of Bristol. The BlueCrystal–I system is composed of 416 nodes, each with two quad-core 2 . Intel ﬁ Harpertown processors and 8 GB in ram . The system is interconnected witha

QLogic InﬁniPath ﬁ high-speed network.As an experimental setup, we tested the strong scaling capabilities of p et fmm up to 64 nodes, with only one process per core in order to concentrate onnetwork communications. The test case corresponds to the use of the fmm inthe context of an application of the vortex method to a viscous ﬂow problem;see §

3. We use the Lamb-Oseen vortex, a known analytical solution of theNavier-Stokes equations, to initialize the strength (representing the amountof circulation γ i ) of each of the particles of the system, as in [4]. The coresizes of the particles are uniform for all the particles and set to σ = 0 . h and is obtained from the relation hσ = 0 .

8, as in [4].The analytical solution for the vorticity ﬁeld of the Lamb-Oseen vortex is: ω ( r, t ) = Γ πνt e − r / νt , (16)where r = x + y , Γ represents the total circulation of the vortex, ν is theviscosity of the ﬂuid and t is time. The velocity ﬁeld, in turn, is given by: u ( r, t ) = Γ πr exp (cid:16) − e − r / νt (cid:17) . (17)The analytical solution of the velocity ﬁeld for the Lamb-Oseen vortex isused to compare the results obtained with solving the Biot-Savart velocity,accelerated via the fmm . Extensive studies of the accuracy of the fmm for thisproblem, with respect to the algorithm parameters, were reported in [8]. Manyof the experiments reported there were repeated with p et fmm , as part of theveriﬁcation stage of the parallel program. For the strong scaling experimentsreported below, we designed a problem setup with parameters which wereappropriate to test the parallel performance and strain communications, butwhich may not be optimal from the point of view of accuracy of the computedﬁeld. In particular, the scaling study used many levels in the tree, whichintroduces errors of Type I, as reported in [8], related to kernel substitution.21 T i m e [ s ec ] Number of processorsME InitializationUpward SweepDownward SweepEvaluationTotal time

Fig. 6. Measured time for experimental results obtained with p et fmm , presented inbase 10 logarithm scale, for an increasing number of processors, presented in base2 logarithm scale. The timings for the total execution of p et fmm (labeled as Totaltime ) and for the the more important stages of the algorithm are presented.

We now present the results of numerical experiments with p et fmm . In orderto present the scalability of p et fmm , we report strong scaling results, i.e. , wecompare how the total execution time of p et fmm varies with the number ofprocessors for a ﬁxed problem size. The same set of parameters were used forall the experimental results: N = 765 , p = 17.Therefore, the total work remains the same but we vary the number of pro-cessors used: P = { , , , , , } . Figure 6 shows the measured time forexperimental results obtained with p et fmm for increasing number of proces-sors. Measured times for the more important stages of the fmm algorithm arealso given, eﬀectively depicting the fraction of time consumed by each stageof the algorithm.For the analysis of performance for p et fmm , we present the speedup, paralleleﬃciency and a load balancing metric. For a deﬁnition of speedup ( S ) we use22 S p ee dup Number of processorsPetFMM SpeedupPerfect Speedup

Fig. 7. Speedup of p et fmm for an increasing number of processors. The speedup andthe number of processors are presented in a base 2 logarithm scale. The deviation of p et fmm from perfect speedup indicates that there are still some serial stages withinthe algorithm and communication overheads. the factor given by S ( N, P ) = execution time for serial caseexecution time for P processors (18)A perfect speedup means that the parallel program perfectly scales with thenumber of processors. This is almost never achieved, due to the existence ofserial stages within an algorithm and communication overheads of the parallelimplementation. Figure 7 shows the speedup of p et fmm up to 64 processors,and compares it against a perfect speedup. Despite the existence of serialstages, the speedup achieved is excellent for a ﬁrst version of the software.Scaling to larger numbers of processors will require additional algorithm im-provements; some thoughts about how we expect to achieve this are given inthe Conclusions.The deﬁnition of parallel eﬃciency ( E ) that we use is: E ( N, P ) = S ( N, P ) P . (19)Parallel eﬃciency equal to unity means that the parallel implementation scales23 E ff i c i e n c y Number of processorsME InitializationUpward SweepDownward SweepEvaluationTotal time

Fig. 8. Parallel eﬃciency of p et fmm for an increasing number of processors. Hereonly the relevant range of eﬃciency is plotted against the number of processors ina base 2 logarithm scale. perfectly with the serial implementation; this is equivalent to perfect speedup,but eﬃciency is a more revealing metric. Both of these performance metricsassume that the instructions executed by the parallel application are the sameas the instructions executed by the serial application. Of course, this is not thecase. In general, the diﬀerences between serial and parallel implementationsare due to: diﬀerent workloads between the serial and parallel processes, andthe existence in the parallel implementation of data communication amongprocesses and synchronization of processes. As a consequence of these diﬀer-ences, perfect speedup or perfect eﬃciency ( E = 1) are generally not obtained.In fact, as seen in Figure 8, it can happen that an eﬃciency value greater thanunity is obtained, which reveals the shortcomings of the metric. Figure 8 showsthe parallel eﬃciency of p et fmm for the main stages of the fmm algorithm.Figure 9 shows a load balance metric, which we deﬁne as LB ( P ) = minimum execution time for one of P processorsmaximum execution time for one of P processors (20)A load balance metric of unity would indicate equal execution times amongall processors in a parallel run, and thus perfect load balancing. Figure 9 alsoincludes the total parallel eﬃciency, which follows a similar trend, indicating24 Fig. 9. Load balance metric of p et fmm for an increasing number of processors. Hereonly the relevant range of eﬃciency is plotted against the number of processors in abase 2 logarithm scale. The load balance metric is compared with the total eﬃciencyof the runs. that parallel overheads and communications overheads are correlated witha degradation of the load balance metric. Nevertheless, it can be seen thatprocessor execution times are within 5% of each other for 32 processors andwithin 7% of each other for 64 processors, demonstrating the success of ourload balancing strategy.The largest problem size that has been run with p et fmm at this point solvedfor 64 million particles on 64 processors. The total execution time was 115 . .

01 GB per processor. p et fmm is currently able to achieve over 90% parallel eﬃciency for 32 proces-sors, and over 85% parallel eﬃciency for 64 processors. We have presented an extension of the Greengard-Gropp parallel complexityanalysis for the fmm which is suitable for distributed computing. We thusconsider partition quality in terms of both communication volume and load25alance, absent from the original analysis, in order to derive an optimallydistributed algorithm. This algorithm can maintain good performance withhighly non-uniform particle distributions, heterogeneous computing environ-ments, and low bandwidth connections.We have also delivered an open source, extensible implementation of this al-gorithm, intended as a community resource. The software library is truly opensource , and distributed under the pets c license, even less restrictive than the bsd license. This guarantees the maximum impact to the scientiﬁc communityand encourages peer-based collaboration for the extensions and applications.Our original approach to parallelization relies on recasting the hierarchicalspace division represented by a tree structure, into a weighted graph gener-ated from a model of the computational work and communications estimates.The optimal partitioning of the graph produces dynamical load balancing.Experiments showed that processor execution times were within 5% of eachother for 32 processors and within 7% of each other for 64 processors, demon-strating the success of our load balancing strategy. The parallelization strategypermitted almost complete reuse of the serial implementation, and due to itssimplicity, was eﬀected with very little additional code. It has shown very goodstrong scaling on small clusters. The library is currently able to achieve over90% parallel eﬃciency for 32 processors, and over 85% parallel eﬃciency for64 processors. The largest problem size that has been run with the p et fmm library at this point was 64 million particles in 64 processors.At present, we are working on exposing more concurrency in the algorithmto exploit multi-core architecture; our real goal here is to extend these resultsto large clusters, and also to heterogeneous systems, including programmablegraphics processing units, gpu . With respect to delivering scalability to largernumbers of processing, we are investigating a strategy based on recursive tree-cutting approach, where the methodology described here can generate largesubtrees which are further cut to obtain sub-subtrees.In terms of the extensibility of p et fmm , the ﬁrst and most straightforwardextension is to introduce 3D capability. Our methodology applies withoutmodiﬁcation to 3D, as our tree and particle objects are templated over thedimension. Moreover, the parallel model involving work and communicationsestimates, and the setup for optimization of the partition, all apply to 3D. Themain modiﬁcation here involves the implementation of a new kernel, incorpo-rating the 3D series expansions, and tranlation/transformation operators.Future extensions of p et fmm include the implementation of diﬀerent kernels,for example Lennard-Jones potential for molecular dynamics applications, and According to the deﬁnition provided in p et fmm to be agood starting point for their scientiﬁc applications. To encourage peer-basedcollaboration, we are working to deliver an oﬃcial release by the end of theyear, consisting of a User’s Manual, example code and veriﬁcation methods. Acknowledgments

Computing time provided by the Advanced Computing Research Centre, Uni-versity of Bristol. FAC acknowledges ﬁnancial support from Airbus and BAESystems under contract ACAD 01478. LAB acknowledges partial support fromEPSRC under grant contract EP/E033083/1, and from Boston University Col-lege of Engineering.

References [1] Christopher R. Anderson. An implementation of the fast multipole methodwithout multipoles.

SIAM J. Sci. Stat. Comput. , 13(4):923–947, 1992.[2] Andrew W. Appel. An eﬃcient program for many-body simulation.

SIAM J.Sci. Stat. Comput. , 6(1):85–103, 1985.[3] S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. Knepley, L. Curfman-McInnes, B. F. Smith, and H. Zhang. PETSc User’s Manual. Technical ReportANL-95/11 - Revision 2.1.5, Argonne National Laboratory, 2002.[4] L. A. Barba, A. Leonard, and C. B. Allen. Advances in viscous vortex methods– meshless spatial adaption based on radial basis function interpolation.

Int. J.Num. Meth. Fluids , 47(5):387–421, 2005.[5] J. Barnes and P. Hut, P. A hierarchical O ( N log N ) force-calculation algorithm. Nature , 324:446–449, December 1986.[6] B.R. Brooks, R.E. Bruccoleri, D.J. Olafson, D.J. States, S. Swaminathan, andM. Karplus. CHARMM: A program for macromolecular energy, minimization,and dynamics calculations.

J. Comp. Chem. , 4:187–217, 1983.[7] J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithmfor particle simulations.

SIAM J. Sci. Stat. Comput. , 9(4):669–686, 1988.[8] Felipe A. Cruz and L. A. Barba. Characterization of the accuracy of thefast multipole method in particle simulations.

Int. J. Num. Meth. Eng. , 2009.Published online May 5, 2009.[9] Z. Gimbutas and V. Rokhlin. A generalized fast multipole method fornonoscillatory kernels.

SIAM J. Sci. Comput. , 24(3):796–817, 2002.

10] L. Greengard and W. D. Gropp. A parallel version of the fast multipole method.

Comp. Math. Appl. , 20(7):63–71, 1990.[11] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations.

J.Comput. Phys. , 73(2):325–348, 1987.[12] Pascal Hav´e. A parallel implementation of the fast multipole method forMaxwell’s equations.

Int. J. Numer. Meth. Fluids , 43(8):839–864, 2003.[13] R. W. Hockney and J. W. Eastwood.

Computer Simulation Using Particles .McGraw-Hill, New York, 1981.[14] George Karypis and Vipin Kumar. A parallel algorithm for multilevel graphpartitioning and sparse matrix ordering.

J. Par. Dist. Comp. , 48:71–85, 1998.[15] Matthew G. Knepley and Dmitry A. Karpeev. Mesh algorithms for PDE withSieve I: Mesh distribution.

Sci. Prog. , 2009. To appear.[16] J. Kurzak and B. M. Pettitt. Massively parallel implementation of a fastmultipole method for distributed memory machines.

J. Par. Dist. Comp. ,65(7):870–881, 2005.[17] Shuji Ogata, Timothy J. Campbell, Rajiv K. Kalia, Aiichiro Nakano, PriyaVashishta, and Satyavani Vemparala. Scalable and portable implementationof the fast multipole method on parallel computers.

Comp. Phys. Comm. ,153(3):445–461, 2003.[18] James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, EmadTajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, LaxmikantKale, and Klaus Schulten. Scalable molecular dynamics with NAMD.

J. Comp.Chem. , 26:1781–1802, 2005.[19] W. T. Rankin. DPMTA—distributed parallel multipole tree algorithm. http://people.ee.duke.edu/ ∼ wrankin/Dpmta/ , checked May 2009.[20] W. T. Rankin. Eﬃcient parallel implementaions of multipole-based N -bodyalgorithms . PhD thesis, Department of Electrical and Computer Engineering,Duke University, 1999.[21] R. D. Skeel, I. Tezcan, and D. J. Hardy. Multiple grid methods for classicalmolecular dynamics. J. Comp. Chem. , 23(6):673–684, 2002.[22] Michael S. Warren and John K. Salmon. A parallel hashed oct-tree N-body algorithm. In

Proceedings of the 1993 ACM/IEEE conference onSupercomputing , pages 12–21, New York, 1993. ACM.[23] Lexing Ying, George Biros, and Denis Zorin. A kernel-independent adaptive fastmultipole algorithm in two and three dimensions.

J. Comput. Phys. , 196(2):591–626, 2004.[24] Lexing Ying, George Biros, Denis Zorin, and Harper Langston. A newparallel kernel-independent fast multipole method. In

Supercomputing, 2003ACM/IEEE Conference , page 14. IEEE Computer Society, 2003., page 14. IEEE Computer Society, 2003.