PetFMM--A dynamically load-balancing parallel fast multipole library
PPetFMM—A dynamically load-balancingparallel fast multipole library
Felipe A. Cruz a Matthew G. Knepley b L. A. Barba c , ∗ a Department of Mathematics, University of Bristol, England BS8 1TW b Computation Institute, Argonne National Laboratory and University of Chicago c Department of Mechanical Engineering, Boston University, Boston MA 02215
Abstract
Fast algorithms for the computation of N -body problems can be broadly classifiedinto mesh-based interpolation methods, and hierarchical or multiresolution meth-ods. To this last class belongs the well-known fast multipole method ( fmm ), whichoffers O ( N ) complexity. The fmm is a complex algorithm, and the programmingdifficulty associated with it has arguably diminished its impact, being a barrier foradoption. This paper presents an extensible parallel library for N -body interactionsutilizing the fmm algorithm, built on the framework of pets c. A prominent fea-ture of this library is that it is designed to be extensible, with a view to unifyingefforts involving many algorithms based on the same principles as the fmm andenabling easy development of scientific application codes. The paper also details anexhaustive model for the computation of tree-based N -body algorithms in parallel,including both work estimates and communications estimates. With this model, weare able to implement a method to provide automatic, a priori load balancing ofthe parallel execution, achieving optimal distribution of the computational workamong processors and minimal inter-processor communications. Using a client ap-plication that performs the calculation of velocity induced by N vortex particles,ample verification and testing of the library was performed. Strong scaling resultsare presented with close to a million particles in up to 64 processors, including bothspeedup and parallel efficiency. The largest problem size that has been run with the p et fmm library at this point was 64 million particles in 64 processors. The library iscurrently able to achieve over 90% parallel efficiency for 32 processors, and over 85%parallel efficiency for 64 processors. The software library is open source under the pets c license, even less restrictive than the bsd license; this guarantees the maxi-mum impact to the scientific community and encourages peer-based collaborationfor the extensions and applications. Key words: fast multipole method, order- N algorithms, particle methods, vortexmethod, hierarchical algorithms, parallel computing, dynamic load balancing Preprint submitted to Elsevier 15 May 2009 a r X i v : . [ c s . D C ] M a y Introduction
So-called N -body problems arise in many areas ( e.g. astrophysics, moleculardynamics, vortex methods, electrostatics). In these problems, the system isdescribed by a set of N particles, and the dynamics of the system is the resultof the interactions that occur for every pair of particles. To calculate all suchinteractions, the total number of operations required normally scales as N .One useful way to mathematically express an N -body problem is by means ofa matrix-vector multiplication, where the matrix is dense and represents theparticle interactions, and the vector corresponds to the weights of the parti-cles. Thus, the mat-vec product corresponds to the evaluation of all pairwiseinteractions. In a naive implementation, we would directly create the matrix incomputer memory and then perform the multiplication with the vector. Thisnaive approach would prove feasible only for small N , as the computationalrequirements in processing power and memory both grow as N . For this rea-son, many efforts have been directed at producing efficient implementations,capable of performing the mat-vec operation at reduced memory requirementsand operation counts.The various methods for more efficient calculation of the particle interac-tion problem can be broadly classified in two types: mesh-based interpolationmethods, and hierarchical or multiresolution methods. The basic mesh-basedmethod is the particle-mesh ( pm ) approach, in which particle information isinterpolated onto a lattice, the field of interest is solved on the mesh by meansof some grid-based method such as finite difference, and the field informationis finally interpolated back to the particle locations [13]. (This method is alsocalled particle-in-cell, pic .) In some applications, such as molecular dynamics,the smoothing at the short-range introduced by the interpolations in the pm method is unacceptable. An alternative mesh-based method can then be usedin which all near-field potentials are calculated directly, while far-field effectsare calculated with the pm method; this is called the particle-particle/particle-mesh method ( p m ).The second class of methods provides efficient computation of the interactionsby means of a hierarchical or multilevel approach. The main subset of thisclass performs a hierarchical subdivision of the computational domain, whichis used to group particles at different length scales, and then approximates theinteractions of clusters of particles using series expansions. The approximationis applied to far-field interactions, while near-field interactions are summed di-rectly. This type of methods can be considered meshfree, and includes: treecodes [5,2], the fast multipole method ( fmm ) [11] and its variations [7,1,23,9].An alternative multilevel method has been presented by Skeel [21], which ∗ Correspondence to: [email protected] p m , but unlike p m it provides acceleration via multiple grid levels, achieving O ( N ) complexity. Thus, this method applies the multilevel approach via fielddecomposition, rather than spatial decomposition; it can perhaps be viewedas a hybrid of the mesh-based and hierarchical methods.In this paper, we present the theory and development of a parallel fast multi-pole library, p et fmm , belonging to the meshfree type of methods describedabove. The overarching goal is to unify the efforts in the development of fmm -like methods into an open-source library, that provides a framework with thecapacity to accommodate memory efficiency, parallelism and the data struc-tures used for spatial decomposition. The software is implemented utilizing pets c, the parallel library for scientific computing developed over more than17 years at Argonne National Laboratory [3]. At this point, we have a com-plete implementation of the fmm in parallel, with dynamic load balancingprovided by means of an optimization approach—minimizing inter-node com-munications and per-node computational work. But a prominent feature ofthis implementation is that it is designed to be extensible , so that it can ef-fectively unify efforts involving many algorithms which are based on the sameprinciples of the fmm . The perspectives for extensibility are described in thispaper as well.The development of this extensible parallel library for N -body interactionsis important due to the programming difficulty associated with the fmm al-gorithm, which has been a barrier for adoption and arguably diminished itspotential impact. A critical stage in the maturation of a computational fieldbegins with the widespread availability of community software for the centralalgorithms. A good example is molecular dynamics, which blossomed afterintroduction of freely available packages, such as charmm [6] and namd [18].Such game-changing community software does not exist for particle methodsor for N -body computations in general.The present paper does not merely describe a new parallel strategy and im-plementation, it also details an exhaustive model for the computation of tree-based N -body algorithms in parallel. Our model is a significant extension ofthe time model developed in [10], which assumed a uniform distribution of p et fmm stands for ‘portable extensible toolkit for fmm ’, as in pets c, which is‘portable extensible toolkit for scientific computing’. fmm was that of Greengard andGropp [10], on a shared memory computer. They also presented a timing modelfor a perfectly balanced implementation, of which we say more in §
5. Morerecent versions of the parallel fmm have been presented in [24,12,17,16]. Manyof these codes produce a partition of the data among processors based upon aspace-filling curve, as previously introduced for the case of parallel treecodesin [22]. Only one of these parallel fmm codes represents a supported, opensource code available for community use. The kifmm
3d code can be down-loaded and modified under the gpl license. It is not an implementation of theclassic fmm algorithm, however, but rather a kernel-independent version de-veloped by the authors. This algorithm does not utilize series expansions, butinstead uses values on a bounding surface of each subdomain obtained throughan iterative solution method; these ideas are based on [1]. The code does notappear to allow easy extension to traditional fmm , or to the other relatedalgorithms alluded to above. Moreover, it does not appear to be designed asan embedded library component, part of a larger multiphysics simulation.This paper is organized as follows. We present first a brief overview of the fmm algorithm; this presentation is necessarily cursory, as a large body ofliterature has been written about this method. A basic description, however,is necessary to agree on a common terminology for the rest of the paper. Ourapproach is to illustrate the method using graphical representations. The fol-lowing section ( §
3) describes our client application code, the vortex particlemethod for simulation of incompressible flow at high Reynolds numbers. Inthis method, the fmm is one of two approaches commonly used to obtain thevelocity of particles from the vorticity field information; the second approach(as in other N -body problems) is to interpolate information back and forthfrom a mesh, while solving for the field of interest on the mesh. Althoughequally efficient—both can be O ( N )—the extensive use of interpolations mayintroduce numerical diffusion, which is undesirable in certain applications.Next, § §
5. The subsequent section ( §
6) presents details ofour software design, implementation and verification carried out. Results ofcomputational experiments with the parallel software are presented in §
7, and4e end with some conclusions and remarks about future work.Note that a very simple and flexible, yet efficient and scalable code has beenproduced. The entire p et fmm implementation of fmm is only lines ofC ++ , including comments and blank lines. It is already freely downloadablefrom http://petsc.cs.iit.edu/petsc/ and we welcome correspondence withpotential users or those who wish to extend it for specific purposes. The fast multipole method ( fmm ) is an algorithm which accelerates compu-tations of the form: f ( y j ) = N (cid:88) i =1 c i K ( y j , x i ) (1)representing a field value evaluated at point y j , where the field is generatedby the influence of sources located at the set of centers { x i } . The sources areoften associated with particle-type objects, such as stellar masses, or chargedparticles. The evaluation of the field at the centers themselves, therefore, repre-sents the well-known N -body problem. In summary: { y j } is a set of evaluationpoints, { x i } is a set of source points with weights given by c i , and K ( y, x ) isthe kernel that governs the interactions between evaluation and source parti-cles. The objective is to obtain the field f at all the evaluation points, whichrequires in principle O ( N ) operations if both sets of points have N elements.Fast algorithms aim at obtaining f approximately with a reduced operationcount, ideally O ( N ).The fmm works by approximating the influence of a cluster of particles bya single collective representation, under the assumptions that the influenceof particles becomes weaker as the evaluation point is further away, i.e. , thekernel K ( y, x ) decays as | x − y | increases, and that the approximations areused to evaluate far distance interactions. To accomplish this, the fmm hier-archically decomposes the computational domain and then it represents theinfluence of sets of particles by a single approximate value. The hierarchicaldecomposition breaks up the domain at increasing levels of refinement, andfor each level it identifies a near and far sub-domain. By using the hierarchicaldecomposition, the far field can be reconstructed as shown in Figure 1.Using the computational domain decomposition, the sum in Equation (1) isdecomposed as f ( y j ) = N near (cid:88) l =1 c l K ( y j , x l ) + N far (cid:88) k =1 c k K ( y j , x k ) (2)5here the right-most sum of (2), representing the far field, is evaluated ap-proximately and efficiently.We now need to introduce the following terminology with respect to the math-ematical tools used to agglomerate the influence of clusters of particles: Multipole Expansion (ME): a series expansion truncated after p termswhich represents the influence of a cluster of particles, and is valid at dis-tances large with respect to the cluster radius. Local Expansion (LE): a truncated series expansion, valid only inside asub-domain, which is used to efficiently evaluate a group of MEs.In other words, the MEs and LEs are series ( e.g , Taylor series) that converge indifferent sub-domains of space. The center of the series for an ME is the centerof the cluster of source particles, and it only converges outside the cluster ofparticles. In the case of an LE, the series is centered near an evaluation pointand converges locally.As an example, consider a particle interaction problem with decaying kernels,where a cluster of particles far away from an evaluation point is ‘seen’ at theevaluation point as a ‘pseudo-particle’, and thus its influence can be repre-sented by a single expression. For example, the gravitational potential of agalaxy far away can be expressed by a single quantity locally. Thus, by usingthe ME representing a cluster, the influence of that cluster can be rapidly eval-uated at a point located far away —as only the single influence of the ME needsto be evaluated, instead of the multiple influences of all the particles in thecluster. Moreover, for clusters of particles that are farther from the evaluationpoint, the pseudo-particle representing that cluster can be larger. This idea,illustrated in Figure 1(b), permits increased efficiency in the computation.The introduction of an aggregated representation of a cluster of particles, viathe multipole expansion, effectively permits a decoupling of the influence ofthe source particles from the evaluation points. This is a key idea, resultingin the factorization of the computations of MEs that are centered at the samepoint, so that the kernel can be written as, K ( x i , y j ) = p (cid:88) m =0 a m ( x i ) f m ( y j ) (3)This factorization allows pre-computation of terms that can be re-used manytimes, thus increasing the efficiency of the overall computation. Similarly, thelocal expansion is used to decouple the influence of an ME from the evaluationpoints. A group of MEs can be factorized into a single LE so that one singleevaluation can be used at multiple points locally. By representing MEs as LEsone can efficiently evaluate a group of clusters in a group of evaluation points.6 .1 Hierarchical space decomposition In order to utilize the tools of MEs and LEs, a spatial decomposition schemeneeds to be provided. In other words, for a complete set of particles, we needto find the clusters that will be used in conjunction with the MEs to approxi-mate the far field, and the sub-domains where the LEs are going to be used toefficiently evaluate groups of MEs. This spatial decomposition is accomplishedby a hierarchical subdivision of space associated to a tree structure ( quadtree structure in two dimensions, or an octree structure in three dimensions) torepresent each subdivision. The nodes of the tree structure are used to de-fine the spatial decomposition, and different scales are obtained by lookingat different levels of the tree. A tree representation of the space decomposi-tion allows us to express the decomposition independently of the number ofdimensions of the space. Consider Figure 1(a) where a quadtree decomposi-tion of the space is illustrated. The nodes of the tree at any given level coverthe whole domain. The relations between nodes of the tree, represent spatialrefinement. The domain covered by a parent box is further decomposed intosmaller sub-domains by its child nodes. Thus, in the fmm the tree structureis used to hierarchically decompose the space and the hierarchical space de-composition is used to represent the near-field and far-field domains. As anexample, consider Figure 1(b) where the near-field for the black colored boxis represented by the light colored boxes, and the far-field is composed by thedark colored boxes.
After the spatial decomposition stage, the fmm can be roughly summarizedin three stages: upward sweep , downward sweep , and evaluation step . In the upward sweep , the objective is to build the MEs for each node of the tree. TheMEs are built first at the deepest level, level L , and then translated to thecenter of the parent nodes. Thus, the MEs at the higher levels do not haveto be computed from the particles, they are computed from the MEs of thechild nodes. In the downward sweep of the tree, the MEs are translated intoLEs for all the boxes in the interaction list . At each level, the interaction listcorresponds to the cells of the same level that are in the far field for a givencell. Once the MEs have been translated into LEs, the LEs of upper levelsare translated and added up to obtain the complete far domain influence foreach box at the leaf level of the tree. At the end of the downward sweep ,each box will have an LE that represents the complete far-field for the box.Finally, at the evaluation step , for every particle at every node at the deepestlevel of the tree, the final field is evaluated by adding the near-field and far-field contributions. The near field of the particles at a given box is obtained7 a) Domain decomposition.(b) Near and Far field.Fig. 1. Quadtree decomposition of a two-dimensional domain: (a) presents the hier-archical tree related to the full spatial decomposition of the domain; (b) presents acolored two dimensional spatial decomposition for the black box and its equivalenceon the tree. The near-field is composed by the bright boxes and the black box itself,while the far-field is composed by the dark colored boxes. Notice that the far-fieldis composed of boxes of different levels of the tree structures. The relations betweenthe nodes of the tree simplify the process of composing the near and far domains. by directly computing the interaction between all the particles in the neardomain of the box. The far field of the particles is obtained by evaluating theLE of the box at each particle location.These ideas can be visualized with an illustration, as shown in Figure 2, whichwe call the “bird’s eye view” of the complete algorithm. The importance ofthis bird’s eye view is that it relates the algorithm computations to the datastructure used by the fmm . This will prove to be very useful when we discussthe parallel version that we have developed.This overview is only intended to introduce the main components of the fmm algorithm, so that we can discuss in the forthcoming sections our strategyfor parallelization. Therefore, it is not a detailed presentation (we have leftout all the mathematics), and readers interested in understanding all the de-tails should consult the original reference [11], and selections of the abundantliterature published since. 8 ownward SweepUpward Sweep Create Multipole Expansions. Evaluate Local Expansions.P2M M2M M2L L2L L2P
Fig. 2. Bird’s eye view of the fmm algorithm. The sketch illustrates the upwardsweep and the downward sweep stages on the tree. Each stage has been furtherdecomposed into the following substages:
P2M –transformation of particles intoMEs (particle-to-multipole);
M2M –translation of MEs (multipole-to-multipole);
M2L –transformation of an ME into an LE (multipole-to-local);
L2L –translationof an LE (local-to-local);
L2P –evaluation of a LEs at particle locations (local-to–particle).
The fast multipole method has many applications, such as calculation of thegravitational field of many stellar bodies, or the electrostatic forces obtainedfrom many charged particles. In fluid dynamics, one application is found inthe calculation of the velocity induced by many vortex particles, which inturn is used as a method of simulation for either the Euler or Navier-Stokesequations. The vortex particle method starts by discretizing the continuousvorticity field, defined as the curl of the velocity ( ω = ∇ × u ), over a set ofmoving nodes located at x i as follows: ω ( x, t ) ≈ ω σ ( x, t ) = N (cid:88) i γ i ζ σ ( x, x i ) (4)where a common choice for the basis function is a normalized Gaussian suchas: ζ σ ( x, y ) = 12 πσ exp (cid:32) −| x − y | σ (cid:33) (5)The discretized vorticity field in (4) is used in conjunction with the vorticitytransport equation. For ideal flows in two dimensions, the vorticity equationsimply expresses that the vorticity is a preserved quantity over particle tra-9ectories: ∂ω∂t + u · ∇ ω = D ω D t = 0 (6)Therefore, the moving nodes can translate with their local velocity and carrytheir vorticity value to automatically satisfy the transport equation. The onlymissing ingredient is obtaining the velocity from the discretized vorticity,which is accomplished using the Biot-Savart law of vorticity dynamics: u ( x, t ) = (cid:90) ( ∇ × G )( x − x (cid:48) ) ω ( x (cid:48) , t ) dx (cid:48) = (cid:90) K ( x − x (cid:48) ) ω ( x (cid:48) , t ) dx (cid:48) = ( K ∗ ω )( x, t )where K = ∇ × G is the Biot-Savart kernel, with G the Green’s function forthe Poisson equation, and ∗ representing convolution. For example, in 2D theBiot-Savart law is written explicitly as, u ( x, t ) = − π (cid:90) ( x − x (cid:48) ) × ω ( x (cid:48) , t ) ˆk | x − x (cid:48) | dx (cid:48) . (7)When the vorticity is expressed as a radial basis function expansion, one canalways find an analytic integral for the Biot-Savart velocity, resulting in anexpression for the velocity at each node which is a sum over all particles. Usingthe Gaussian basis function (5), we have: K σ ( x ) = 12 π | x | ( − x , x ) (cid:32) − exp (cid:32) − | x | σ (cid:33)(cid:33) . (8)where | x | = x + x . Thus, the formula for the discrete Biot-Savart law intwo dimensions gives the velocity as follows, u σ ( x, t ) = N (cid:88) j =1 γ j K σ ( x − x j ) . (9)Therefore, the calculation of the velocity of N vortex particles is an N -bodyproblem, where the kernel K ( x ) decays with distance, which makes it a can-didate for acceleration using the fmm . Also note that as | x | becomes large,the kernel K ( x ) approaches 1 / | x | . We take advantage of this fact to use themultipole expansions of the 1 / | x | kernel as an approximation, while the near-field direct interactions are obtained with the exact kernel K . In a separatework, we have investigated the errors of the fmm for the vortex method, andhave demonstrated that using the expansions for 1 / | x | does not impact onaccuracy, as long as the local interaction boxes are not too small [8].10 Parallelization strategy
The goal of our parallel strategy is to achieve an optimal distribution of thecomputational work among processors and a minimal communication require-ment. We achieve this goal by means of a two-step process: in the first step, wedecompose the complete fmm into basic algorithmic elements ; in the secondstep, we distribute those algorithmic elements across the available processingelements (processors) in an optimal way. By optimal, we mean that when dis-tributing the basic algorithmic elements we optimize their assignment to theprocessing elements by:(1) Load balance: The work performed by each processing element is ade-quate to the processor’s capabilities.(2) Minimization of communications: The total amount of communicationbetween different processing elements is minimized.An important ingredient is that both steps of the above-mentioned processare carried out automatically by an optimization tool, without intervention ofthe user, with the objective of using the available resources with maximumefficiency. The load balancing provided by our method is applied dynamically, a priori of the actual computation.We are aware of one previous attempt to provide automatic load balancing inan fmm implementation. The dpmta code[19] (not maintained or supportedsince ) is a parallel implementation of the fmm algorithm aimed at molec-ular dynamics applications. In this code, an initial uniform partitioning of thedata at the deepest level of the tree is carried out. Then, elapsed wall clocktimes are obtained for each of the processors for the equal partitioning, anda data migration scheme is applied to counter load imbalance. Experimentswere reported in [20] for N =200,000 in 12 processors, showing for the originalequal partition elapsed wall clock times for each processor between 60 and140 seconds. After load balancing, elapsed times were between 80 and 100 sec-onds, indicating a significant improvement. This approach, however, requiresmultiple calculations with the same set of particles; it does not seem to bepossible to use this approach in an ever-evolving particle configuration, whereeach configuration is calculated only once. There was also no theory or compu-tational model of work and communications provided in relation to this code.The experiments cited, however, do provide clear evidence that a straightfor-ward uniform data partition (accomplished using a space-filling curve indexingscheme) can result in considerable load imbalance.In our implementation, we utilize the tree structure associated with the hi-erarchical decomposition of the domain in order to decompose the fmm intobasic algorithmic elements. The tree structure has many roles: it is used as a11
2M and L2L translations M2L transformation Local domainLevel k Root treeSub-tree Sub-tree Sub-tree Sub-tree Sub-tree Sub-tree Sub-tree Sub-tree Fig. 3. Illustration of the data usage patterns and partitioning of a binary-tree. Inthis figure, the tree has been cut at level k = 3. All data usage patterns that takeplace between nodes of the tree are illustrated by arrows. The subtrees generatedafter the partition of the tree are represented by the boxes. Communication betweensubtrees are generated by data usage patterns that take place between two differentsubtrees. space partitioner for the particles, it organizes the storage for the multipoleexpansions and local expansions, and it indicates the relations between nodesin the same level of the tree (if two nodes are from the local domain list orthe interaction list). But, most importantly, it can be used to represent thecomplete algorithm, as mentioned in § algorithm ; one can see that in Figure 2 all parts of the algorithm arerepresented. This view is the basis for our parallelization strategy, as describedbelow.The sub-division of the whole algorithm occurs by “cutting” the d -dimensionaltree at a certain level k , as shown in Figure 3. This procedure produces a root tree, that contains the first k levels of the original tree, and 2 dk local trees, eachcorresponding to one of the lower branches of the original tree. The objectiveof this partitioning strategy is to obtain more subtrees than the number ofavailable processes, so that the subtrees can be optimally distributed acrossthe processes.In the bird’s eye view of the whole algorithm, Figure 2, data and computationsare related to nodes of the tree structure. When partitioning the tree repre-sentation into subtrees, computations that require data from different nodesof the partitioned tree might access data from several different partitions. Ifthis is the case, communication between partitions will happen as illustratedin Figure 3. By relating computations to nodes of the tree, the work carriedout by each partition and the communication between different partitions can12 ij w i w j Fig. 4. Sketch of an undirected graph representation of the original hierarchicaltree. The w i weights represent the work performed by the subtree i and the c ij weights represent the communication between subtrees i and j . The graph is thenpartitioned into a number of parts equal to the available processes; the red lines inthe figure show a possible partitioning of the graph. The partitions are optimizedso that the weights w i assigned to a partition are balanced with respect to otherpartitions, and the cost of the cut edges is minimal. be estimated, which is then used to optimally distribute the partitions overavailable processors.In order to assign subtrees to processors, we build a graph representation fromthe partitioned tree, as illustrated on Figure 4. The graph is assembled suchthat the vertices of the graph correspond to the subtrees and the edges cor-respond to communication between subtrees. Using the graph representation,we can assign weights to the vertices which are proportional to the amountof computational work performed by each subtree, and assign weights to theedges which are proportional to the amount of communication that happensbetween two subtrees.The load balancing in the parallel algorithm is done by partitioning theweighted graph into as many parts as the number of available processors,and then assigning the subtrees to the processors according to the graph par-titions. The problem of obtaining partitions, such that they are well-balancedand minimize the communication, is solved by a graph partitioning tool suchas p ar metis [14]. p ar metis is an open source graph partitioning tool, and isused by p et fmm to create near optimal partitions. Figure 5 demonstrates theload balancing scheme at work. In an example computation, particles havebeen placed inside a square-shaped domain that is then hierarchically decom-posed into a tree representation. The tree is then cut at level k = 4, resultingin 256 parallel subtrees that are then distributed among 16 processors. In thenext section we develop the model of the parallel algorithm, obtaining thenecessary weights for computational work and communication.13 h h h h h h h h h h h h h h h h h c c c c d d d d d d d d d dh h h h h h h h h h h h h h h h h c c c c c d d d d d d d d d dh h g h h h h h h h h h h h h f h c c c c c d d d d d d d d d dg g g g h h h h e e h h h h f f c c c c c c c d d d d d d d d dg g g g g h h h e e e e f f f f c c c c c c c c d d d d d d d dg g g g g h h h e e e e f f f f c c c c c c c c d d d d d d d dg g g g g g g g g e e e f f f f c c c c c c c c d d d a a a d dg g g g g g g g e e e e f f f f f c c c c c c c d d a a a a a ag g g g g g g g e e e e f f f f f c c c c c c c d d a a a a a ag g g g g g g e e e e e f f f f f f c c b c c c a a a a a a a ag g g g g g e e e e e e f f f f f f b b b b b b a a a a a a a ag g g g e e e e e e e f f f f f f f b b b b b b a a a a a a a ag g g g e e e e e e e f f f f f f f b b b b b b b a a a a a a ai g g g e e e e e e e f f f f f f f b b b b b b b b a a a a a ai i i i e e e e e e e e f f f f f f b b b b b b b b a a a a a ai i i i i i e j j e e j j j j b b b b b b b b b b b b a a a a ai i i i i i i i j j j j j j j j b b b b b b b b b b b a p p p pi i i i i i i i j j j j j j j j n b b b b b n p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n n p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n n p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n p p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n p p p p p p p p p pi i k k k k i i j j j j j j j j n n n n n n n p p p p p p p p pk k k k k k k i l l l l j j n n n n n n n n m m p p p p o o o ok k k k k k l l l l l l l l n n n n n n n m m m o o o o o o o ok k k k k k l l l l l l l l n n n n n n n m m m o o o o o o o ok k k k k k l l l l l l l l n m n n n n n m m m o o o o o o o ok k k k k k l l l l l l l l m m m m n m m m m m m o o o o o o ok k k k k k k l l l l l l l m m m m m m m m m m m o o o o o o ok k k k k k k k l l l l l l m m m m m m m m m m m o o o o o o ok k k k k k k l l l l l l l m m m m m m m m m m m o o o o o o ok k k k k k k l l l l l l l l m m m m m m m m m o o o o o o o o Fig. 5. Illustration of the outcome of the automatic load balancing for a uniformspatial distribution of particles in a square domain. The cells represent the subtreesobtained after the partitioning of the fmm tree. The cells have been labeled andcolored to represent the partition to which they have been assigned after the loadbalancing. In total, 256 subtrees have been distributed among 16 partitions.
The seminal analysis of the parallel complexity of the fmm was given byGreengard and Gropp in 1990 [10]. They develop a model for the runningtime T in terms on the number of particles N , the number of processors P ,and the number of boxes at the finest level B , T = a NP + b log P + c NBP + d N BP + e ( N, P ) . (10)They account here for perfectly parallel work, represented by a , such as themultipole expansion initialization and local expansion evaluation; the reduc-14ion bottleneck, b , such as that associated to multipole-to-multipole transla-tions; multipole-to-local transformations and translations, c ; direct interac-tions, d ; and lower order terms, e .In some respects, however, the analysis in [10] was limited. All the specific casesanalyzed assumed a uniform distribution of particles at the finest level. As aconsequence, no strategy is offered for dealing with non-uniform distributions,and computational imbalance is not addressed. The volume of communicationfor a certain partition of work is also not estimated in this model. Below, weextend the model by giving estimates for both communication volume andcomputational load. The new, extended model will allow us to generate anunstructured data distribution which is optimal in terms of load balance andcommunication overhead. In Section 4, we discussed the parallelization strategy, where the tree represen-tation of the fmm algorithm is decomposed into a set of subtrees. We now dis-cuss in more detail the data communication that takes place between subtrees.Figure 3 presents a simplified illustration of the communications showing allthe data usage patterns across different parts of the algorithm. The illustrationhelps us in the identification of three types of communications that take placebetween two subtrees: multipole-to-multipole translations (M2M), multipole-to-local transformations (M2L), and local-to-local translations (L2L). Figure 4is an equivalent graph representation of the partitioned tree from Figure 3,translating graphically the hierarchical tree into a set of vertices and edges: thesubtrees are represented as vertices and the communication between subtreescorrespond to the edges of the graph.In order to determine the edge weights of the graph, it is necessary to identifyall the communications required between nodes of the fmm tree associated todifferent vertices of the equivalent graph. Communications between sub-treesat different stages of the fmm algorithm can be classified into two types: com-munication of particles in the local domain, and communication of expansionterms . This last type occurs at M2M and L2L translations, and M2L transfor-mations. A complete picture of the communication patterns for a binary-treecan be seen in Figure 3. We note that the communication patterns for theM2M and L2L translations occur only from subtrees to root tree and viceversa, while no communication between subtrees takes place for these opera-tions.Quantification of the amount of communication between each pair of vertices isrequired. For each communication pattern, we define a communication matrix,15here each element c ij of the matrix indicates the amount of communicationfrom vertex i to vertex j . We now consider the two-dimensional case, where aquadtree is used. By studying the communication patterns between nodes inthe quadtree, we arrive at estimates of the communications between subtreesof the truncated tree. Thus, we produce the following estimate for the amountof communication between two lateral neighboring subtrees of a quadtree, L (cid:88) n = k +1 α comm (2) n − k ∗ L is the depth of the fmm tree, k is the cut level of the quadtree, and α comm is a constant depending on the expansion order p and the size of floatingpoint numbers used. In the same way, we obtain the following estimate for theamount of communication between two diagonal neighboring subtrees, α comm (( k − L ) − ∗ z -ordernumbering of the nodes is used to discover the neighbor sets for every vertexof the graph without any communication between processes. All communica-tion occurs between neighboring local trees, so we can fill the communicationmatrix as follows, For all node j at level k:For all node i from the neighbor set of j:if i is a lateral box:c[i][j] += lateral node estimateelsec[i][j] += diagonal node estimate
In Section 2, we decomposed the fmm algorithm into three stages: upwardsweep, downward sweep, and the evaluation step. In this section we discusswork estimates following the same decomposition.In the upward sweep, the MEs for all the nodes of the fmm tree are built. Ittakes O ( N i p ) operations to build the MEs for a node that is in the maximumlevel of refinement (or the leaf level of the fmm tree), where N i correspondsto the number of particles associated to the box i , and p is the number ofexpansion terms retained. To obtain the MEs for the upper levels of the fmm tree, each node of the upper levels translates into its center the MEs of their16hild nodes. The translation of a single ME has cost O ( p ) and if a node has n c child boxes the work performed by a node is O ( n c p ).In the downward sweep, the LEs that represent the complete far field of aleaf node are built. For each local cluster, LEs are built for all clusters in itsinteraction list. If the interaction list (IL) has n IL members and the cost to transform an ME into an LE is O ( p ) then, for each node of the fmm tree, thetotal work to obtain the LEs for its interaction list is O ( p n IL ). After all theMEs are converted into LEs, the LEs are propagated from the parent box tothe child boxes. For each non-leaf node of the fmm tree, the LE is translatedfrom a node’s center into its n c child boxes. The work to translate a single LEis O ( p ), thus the work performed by a node is O ( n c p ).In the evaluation step, the complete field is obtained for all the particles of thesystem, where the near field is computed from the local direct interactions andthe far field is obtained from the LEs. The near domain is given by the leafnode that contains the particle. If, for a box, the near domain is given by n nd nodes, each with N i particles, the cost of computing all pairwise interactionsfor a leaf node is O ( n nd N i ). The far field of a particle is obtained from theevaluation of the LE of the box that contains the particle. If a box has N i particles, the cost of evaluating all the far field for all the N i particles is O ( N i p ) where p is the number of terms in the LE.In summary, the amount of work done by a non-leaf node of the fmm tree is: O ( p (2 n c + n IL )) (13)and the amount of work performed by a leaf node is: O (2 N i p + p n IL + n nd N i ) (14)where n c corresponds to the number of children of a box, p is the number ofexpansion coefficients retained, n IL is the estimate of the number of membersin the interaction list, and N i is the number of particles in the box i .When partitioning the fmm tree, T sub-trees are created, each of which hasthe same number of levels L st . The estimate for the total amount of workperformed by each subtree depends on its number of nodes and its numberof particles at the leaf level. By means of equations (13) and (14), we canestimate the amount of work performed by any subtree and use this as weightfor the corresponding vertex of the graph: (cid:32) L st − (cid:88) l =0 dl p (2 n c + n IL ) (cid:33) + 2 d ( L − (cid:16) N i p + p n IL + n nd N i (cid:17) (15)17 able 1Memory usage for serial quadtree structures.Type Bookkeeping Memory (bytes) Data Memory (bytes)Box centers 0 8 d ΛInteraction boxes (2 ∗ ∗ ∗ d + 16 p )ΛMultipole coefficients 0 16 p ΛTemporary coefficients 0 16 p ΛLocal coefficients 0 16 p ΛLocal particles (2 ∗ BN Neighbor particles (2 ∗ Bs dL We will first estimate memory usage in the serial case, and then extend theseresults to the parallel code, since most of the computational structures arepreserved. The quadtree structure stores data of three types: O (1) storage,such as sizes, data across boxes of the finest level, and data across all nodesof the tree. Over the entire tree, we store the box centers, the interactionlist and associated values, and three expansion coefficients. We store localand neighbor particles only at the finest level of the tree. Let d be the spacedimension, L be the maximum level, p the number of expansion terms, N thenumber of particles, B = 28 the size of a particle in bytes, s the maximumnumber of particles per box, and let Λ be the total number of boxes in thetree, given by Λ = L (cid:88) dl = 2 d ( L +1) − . The maximum memory usage is given in Table 1. Notice that even for verynonuniform distributions, the memory usage for neighbor particles is boundedabove by a constant multiplied by the total number of particles. Thus thememory usage is linear in the number of boxes at the finest level and thenumber of particles.In the parallel case, we reuse our serial data structures completely, and thus weonly need to estimate the memory usage from explicitly parallel constructs. Wemaintain both a partition of boxes, and its inverse. We also have overlap [15]structures, essentially box-to-box maps between processes, for both neighborparticles and interaction list values. Let P be the number of processes, N lt the maximum number of local trees, N bd the maximum number of boundaryboxes for a process, A = 108 the size of an arrow in the overlap structure. The18 able 2Memory usage for parallel quadtree structures.Type Bookkeeping Memory (bytes) Data Memory (bytes)Partition (2 ∗ P N lt Inverse partition 0 4 N lt Neighbor send overlap N/A N bd sA Neighbor recv overlap N/A N bd sA Interaction send overlap N/A 27 N bd A Interaction recv overlap N/A 27 N bd A maximum memory usage per process is given in Table 2. The neighbor overlapstructure is bounded by the total number of particles exchanged, whereasthe interaction list overlap is bounded by the maximum cut size, a constantmultiplied by N lt .Future improvements may include memory savings through computing valuesdynamically. For instance, we need not store the interaction list boxes, as theycan be generated from the global box number. Also, for the case of uniformspacing, the box centers can be determined on the fly. Neighboring particles,except for those from other processes, could be retrieved from the local particlestorage. The inverse partition of boxes could be calculated on the fly, albeitwith communication. The p et fmm library was designed to offer both high serial performance andscalability, but also to be easily integrated into existing codes and maintained.The serial code is completely reused in the parallel setting so that we are neverrequired to maintain two versions of the same algorithm. p et fmm leveragesexisting packages to keep its own code base small and clear. Parallel datamovement is handled by the Sieve package [15], while load and communicationare balanced using p ar metis , as described in the previous sections.The problem geometry is encapsulated in the Quadtree class. All relations,such as neighbors and interaction lists, can be dynamically generated so thatwe need only store data across the cells. Data is stored in
Sieve
Section objects, which allow easy communication across unstructured boundaries with19rbitrary layouts. The
ParallelQuadtree class presents an identical interface,but manages a set of local
Quadtrees and a root
Quadtree to mimic thesemethods.All serial computation is encapsulated in the
Evaluator class, which op-erates over a
Quadtree . It is templated over a
Kernel object, which pro-duces the required expansions and translation operators, so that we can eas-ily replace one equation with another. The
ParallelEvaluator , which in-herits from
Evaluator , overrides some methods in order to operate over a
ParallelQuadtree . All computation in the parallel code is still done by serialstructures, and only data partitioning and movement is accomplished by newcode, mainly derived from the
Sieve library.
From an initial, serial Python implementation of the fmm algorithm , we pro-duced a parallel C ++ version in little more than two weeks (with subsequentdebugging and verification over several more weeks). The pets c Sieve frame-work [15] greatly aided parallelization. We were also careful at each stage tointroduce unit testing and verify our results against those of the well-studiedPython code [8]. We developed a file format for result verification, consistingof (cid:46)
The number of levels, terms, particles, and tree coordinates (cid:46)
Particle assignment to quadtree cells (cid:46)
Center, number of particles, children, neighbors for each box (cid:46)
Interaction list and coefficients for each box (cid:46)
The direct and fmm solutionsAll domain boxes were labeled with global numbers in the output, which en-ables the parallel code to be compared with the serial implementations. More-over, the box output may come in any order, making parallel output trivial.A small Python script then compares two output files, noting any discrepan-cies. In this way we compared the initial Python code with the C ++ version,both serial and parallel, and even among parallel runs with different numberof processes. Confidence in our results enabled us to make rapid changes, in-troducing new algorithms such as the improved partitioning scheme withoutlengthy debugging. Available for free download at http://code.google.com/p/pyfmm/ Computational experiments with the parallel software
The results presented here were obtained on the BlueCrystal–I system at theAdvanced Computing Research Centre, University of Bristol. The BlueCrystal–I system is composed of 416 nodes, each with two quad-core 2 . Intel fi Harpertown processors and 8 GB in ram . The system is interconnected witha
QLogic InfiniPath fi high-speed network.As an experimental setup, we tested the strong scaling capabilities of p et fmm up to 64 nodes, with only one process per core in order to concentrate onnetwork communications. The test case corresponds to the use of the fmm inthe context of an application of the vortex method to a viscous flow problem;see §
3. We use the Lamb-Oseen vortex, a known analytical solution of theNavier-Stokes equations, to initialize the strength (representing the amountof circulation γ i ) of each of the particles of the system, as in [4]. The coresizes of the particles are uniform for all the particles and set to σ = 0 . h and is obtained from the relation hσ = 0 .
8, as in [4].The analytical solution for the vorticity field of the Lamb-Oseen vortex is: ω ( r, t ) = Γ πνt e − r / νt , (16)where r = x + y , Γ represents the total circulation of the vortex, ν is theviscosity of the fluid and t is time. The velocity field, in turn, is given by: u ( r, t ) = Γ πr exp (cid:16) − e − r / νt (cid:17) . (17)The analytical solution of the velocity field for the Lamb-Oseen vortex isused to compare the results obtained with solving the Biot-Savart velocity,accelerated via the fmm . Extensive studies of the accuracy of the fmm for thisproblem, with respect to the algorithm parameters, were reported in [8]. Manyof the experiments reported there were repeated with p et fmm , as part of theverification stage of the parallel program. For the strong scaling experimentsreported below, we designed a problem setup with parameters which wereappropriate to test the parallel performance and strain communications, butwhich may not be optimal from the point of view of accuracy of the computedfield. In particular, the scaling study used many levels in the tree, whichintroduces errors of Type I, as reported in [8], related to kernel substitution.21 T i m e [ s ec ] Number of processorsME InitializationUpward SweepDownward SweepEvaluationTotal time
Fig. 6. Measured time for experimental results obtained with p et fmm , presented inbase 10 logarithm scale, for an increasing number of processors, presented in base2 logarithm scale. The timings for the total execution of p et fmm (labeled as Totaltime ) and for the the more important stages of the algorithm are presented.
We now present the results of numerical experiments with p et fmm . In orderto present the scalability of p et fmm , we report strong scaling results, i.e. , wecompare how the total execution time of p et fmm varies with the number ofprocessors for a fixed problem size. The same set of parameters were used forall the experimental results: N = 765 , p = 17.Therefore, the total work remains the same but we vary the number of pro-cessors used: P = { , , , , , } . Figure 6 shows the measured time forexperimental results obtained with p et fmm for increasing number of proces-sors. Measured times for the more important stages of the fmm algorithm arealso given, effectively depicting the fraction of time consumed by each stageof the algorithm.For the analysis of performance for p et fmm , we present the speedup, parallelefficiency and a load balancing metric. For a definition of speedup ( S ) we use22 S p ee dup Number of processorsPetFMM SpeedupPerfect Speedup
Fig. 7. Speedup of p et fmm for an increasing number of processors. The speedup andthe number of processors are presented in a base 2 logarithm scale. The deviation of p et fmm from perfect speedup indicates that there are still some serial stages withinthe algorithm and communication overheads. the factor given by S ( N, P ) = execution time for serial caseexecution time for P processors (18)A perfect speedup means that the parallel program perfectly scales with thenumber of processors. This is almost never achieved, due to the existence ofserial stages within an algorithm and communication overheads of the parallelimplementation. Figure 7 shows the speedup of p et fmm up to 64 processors,and compares it against a perfect speedup. Despite the existence of serialstages, the speedup achieved is excellent for a first version of the software.Scaling to larger numbers of processors will require additional algorithm im-provements; some thoughts about how we expect to achieve this are given inthe Conclusions.The definition of parallel efficiency ( E ) that we use is: E ( N, P ) = S ( N, P ) P . (19)Parallel efficiency equal to unity means that the parallel implementation scales23 E ff i c i e n c y Number of processorsME InitializationUpward SweepDownward SweepEvaluationTotal time
Fig. 8. Parallel efficiency of p et fmm for an increasing number of processors. Hereonly the relevant range of efficiency is plotted against the number of processors ina base 2 logarithm scale. perfectly with the serial implementation; this is equivalent to perfect speedup,but efficiency is a more revealing metric. Both of these performance metricsassume that the instructions executed by the parallel application are the sameas the instructions executed by the serial application. Of course, this is not thecase. In general, the differences between serial and parallel implementationsare due to: different workloads between the serial and parallel processes, andthe existence in the parallel implementation of data communication amongprocesses and synchronization of processes. As a consequence of these differ-ences, perfect speedup or perfect efficiency ( E = 1) are generally not obtained.In fact, as seen in Figure 8, it can happen that an efficiency value greater thanunity is obtained, which reveals the shortcomings of the metric. Figure 8 showsthe parallel efficiency of p et fmm for the main stages of the fmm algorithm.Figure 9 shows a load balance metric, which we define as LB ( P ) = minimum execution time for one of P processorsmaximum execution time for one of P processors (20)A load balance metric of unity would indicate equal execution times amongall processors in a parallel run, and thus perfect load balancing. Figure 9 alsoincludes the total parallel efficiency, which follows a similar trend, indicating24 Fig. 9. Load balance metric of p et fmm for an increasing number of processors. Hereonly the relevant range of efficiency is plotted against the number of processors in abase 2 logarithm scale. The load balance metric is compared with the total efficiencyof the runs. that parallel overheads and communications overheads are correlated witha degradation of the load balance metric. Nevertheless, it can be seen thatprocessor execution times are within 5% of each other for 32 processors andwithin 7% of each other for 64 processors, demonstrating the success of ourload balancing strategy.The largest problem size that has been run with p et fmm at this point solvedfor 64 million particles on 64 processors. The total execution time was 115 . .
01 GB per processor. p et fmm is currently able to achieve over 90% parallel efficiency for 32 proces-sors, and over 85% parallel efficiency for 64 processors. We have presented an extension of the Greengard-Gropp parallel complexityanalysis for the fmm which is suitable for distributed computing. We thusconsider partition quality in terms of both communication volume and load25alance, absent from the original analysis, in order to derive an optimallydistributed algorithm. This algorithm can maintain good performance withhighly non-uniform particle distributions, heterogeneous computing environ-ments, and low bandwidth connections.We have also delivered an open source, extensible implementation of this al-gorithm, intended as a community resource. The software library is truly opensource , and distributed under the pets c license, even less restrictive than the bsd license. This guarantees the maximum impact to the scientific communityand encourages peer-based collaboration for the extensions and applications.Our original approach to parallelization relies on recasting the hierarchicalspace division represented by a tree structure, into a weighted graph gener-ated from a model of the computational work and communications estimates.The optimal partitioning of the graph produces dynamical load balancing.Experiments showed that processor execution times were within 5% of eachother for 32 processors and within 7% of each other for 64 processors, demon-strating the success of our load balancing strategy. The parallelization strategypermitted almost complete reuse of the serial implementation, and due to itssimplicity, was effected with very little additional code. It has shown very goodstrong scaling on small clusters. The library is currently able to achieve over90% parallel efficiency for 32 processors, and over 85% parallel efficiency for64 processors. The largest problem size that has been run with the p et fmm library at this point was 64 million particles in 64 processors.At present, we are working on exposing more concurrency in the algorithmto exploit multi-core architecture; our real goal here is to extend these resultsto large clusters, and also to heterogeneous systems, including programmablegraphics processing units, gpu . With respect to delivering scalability to largernumbers of processing, we are investigating a strategy based on recursive tree-cutting approach, where the methodology described here can generate largesubtrees which are further cut to obtain sub-subtrees.In terms of the extensibility of p et fmm , the first and most straightforwardextension is to introduce 3D capability. Our methodology applies withoutmodification to 3D, as our tree and particle objects are templated over thedimension. Moreover, the parallel model involving work and communicationsestimates, and the setup for optimization of the partition, all apply to 3D. Themain modification here involves the implementation of a new kernel, incorpo-rating the 3D series expansions, and tranlation/transformation operators.Future extensions of p et fmm include the implementation of different kernels,for example Lennard-Jones potential for molecular dynamics applications, and According to the definition provided in p et fmm to be agood starting point for their scientific applications. To encourage peer-basedcollaboration, we are working to deliver an official release by the end of theyear, consisting of a User’s Manual, example code and verification methods. Acknowledgments
Computing time provided by the Advanced Computing Research Centre, Uni-versity of Bristol. FAC acknowledges financial support from Airbus and BAESystems under contract ACAD 01478. LAB acknowledges partial support fromEPSRC under grant contract EP/E033083/1, and from Boston University Col-lege of Engineering.
References [1] Christopher R. Anderson. An implementation of the fast multipole methodwithout multipoles.
SIAM J. Sci. Stat. Comput. , 13(4):923–947, 1992.[2] Andrew W. Appel. An efficient program for many-body simulation.
SIAM J.Sci. Stat. Comput. , 6(1):85–103, 1985.[3] S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. Knepley, L. Curfman-McInnes, B. F. Smith, and H. Zhang. PETSc User’s Manual. Technical ReportANL-95/11 - Revision 2.1.5, Argonne National Laboratory, 2002.[4] L. A. Barba, A. Leonard, and C. B. Allen. Advances in viscous vortex methods– meshless spatial adaption based on radial basis function interpolation.
Int. J.Num. Meth. Fluids , 47(5):387–421, 2005.[5] J. Barnes and P. Hut, P. A hierarchical O ( N log N ) force-calculation algorithm. Nature , 324:446–449, December 1986.[6] B.R. Brooks, R.E. Bruccoleri, D.J. Olafson, D.J. States, S. Swaminathan, andM. Karplus. CHARMM: A program for macromolecular energy, minimization,and dynamics calculations.
J. Comp. Chem. , 4:187–217, 1983.[7] J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithmfor particle simulations.
SIAM J. Sci. Stat. Comput. , 9(4):669–686, 1988.[8] Felipe A. Cruz and L. A. Barba. Characterization of the accuracy of thefast multipole method in particle simulations.
Int. J. Num. Meth. Eng. , 2009.Published online May 5, 2009.[9] Z. Gimbutas and V. Rokhlin. A generalized fast multipole method fornonoscillatory kernels.
SIAM J. Sci. Comput. , 24(3):796–817, 2002.
10] L. Greengard and W. D. Gropp. A parallel version of the fast multipole method.
Comp. Math. Appl. , 20(7):63–71, 1990.[11] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations.
J.Comput. Phys. , 73(2):325–348, 1987.[12] Pascal Hav´e. A parallel implementation of the fast multipole method forMaxwell’s equations.
Int. J. Numer. Meth. Fluids , 43(8):839–864, 2003.[13] R. W. Hockney and J. W. Eastwood.
Computer Simulation Using Particles .McGraw-Hill, New York, 1981.[14] George Karypis and Vipin Kumar. A parallel algorithm for multilevel graphpartitioning and sparse matrix ordering.
J. Par. Dist. Comp. , 48:71–85, 1998.[15] Matthew G. Knepley and Dmitry A. Karpeev. Mesh algorithms for PDE withSieve I: Mesh distribution.
Sci. Prog. , 2009. To appear.[16] J. Kurzak and B. M. Pettitt. Massively parallel implementation of a fastmultipole method for distributed memory machines.
J. Par. Dist. Comp. ,65(7):870–881, 2005.[17] Shuji Ogata, Timothy J. Campbell, Rajiv K. Kalia, Aiichiro Nakano, PriyaVashishta, and Satyavani Vemparala. Scalable and portable implementationof the fast multipole method on parallel computers.
Comp. Phys. Comm. ,153(3):445–461, 2003.[18] James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, EmadTajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, LaxmikantKale, and Klaus Schulten. Scalable molecular dynamics with NAMD.
J. Comp.Chem. , 26:1781–1802, 2005.[19] W. T. Rankin. DPMTA—distributed parallel multipole tree algorithm. http://people.ee.duke.edu/ ∼ wrankin/Dpmta/ , checked May 2009.[20] W. T. Rankin. Efficient parallel implementaions of multipole-based N -bodyalgorithms . PhD thesis, Department of Electrical and Computer Engineering,Duke University, 1999.[21] R. D. Skeel, I. Tezcan, and D. J. Hardy. Multiple grid methods for classicalmolecular dynamics. J. Comp. Chem. , 23(6):673–684, 2002.[22] Michael S. Warren and John K. Salmon. A parallel hashed oct-tree N-body algorithm. In
Proceedings of the 1993 ACM/IEEE conference onSupercomputing , pages 12–21, New York, 1993. ACM.[23] Lexing Ying, George Biros, and Denis Zorin. A kernel-independent adaptive fastmultipole algorithm in two and three dimensions.
J. Comput. Phys. , 196(2):591–626, 2004.[24] Lexing Ying, George Biros, Denis Zorin, and Harper Langston. A newparallel kernel-independent fast multipole method. In
Supercomputing, 2003ACM/IEEE Conference , page 14. IEEE Computer Society, 2003., page 14. IEEE Computer Society, 2003.