[PDF] Implementation of a Parallel Tree Method on a GPU

Abstract

The kd-tree is a fundamental tool in computer science. Among other applications, the application of kd-tree search (by the tree method) to the fast evaluation of particle interactions and neighbor search is highly important, since the computational complexity of these problems is reduced from O(N^2) for a brute force method to O(N log N) for the tree method, where N is the number of particles. In this paper, we present a parallel implementation of the tree method running on a graphics processing unit (GPU). We present a detailed description of how we have implemented the tree method on a Cypress GPU. An optimization that we found important is localized particle ordering to effectively utilize cache memory. We present a number of test results and performance measurements. Our results show that the execution of the tree traversal in a force calculation on a GPU is practical and efficient.

Full PDF

aa r X i v : . [ a s t r o - ph . I M ] D ec Implementation of a Parallel Tree Method on a GPU

Naohito Nakasato a a Department of Computer Science and EngineeringUniversity of AizuAizu-Wakamatsu, Fukushima 965-8580, JapanEmail: [email protected]

Abstract

The k d-tree is a fundamental tool in computer science. Among other applications, the application of k d-tree search (by the treemethod) to the fast evaluation of particle interactions and neighbor search is highly important, since the computational complexityof these problems is reduced from O ( N ) for a brute force method to O ( N log N ) for the tree method, where N is the number ofparticles. In this paper, we present a parallel implementation of the tree method running on a graphics processing unit (GPU). Wepresent a detailed description of how we have implemented the tree method on a Cypress GPU. An optimization that we foundimportant is localized particle ordering to e ﬀ ectively utilize cache memory. We present a number of test results and performancemeasurements. Our results show that the execution of the tree traversal in a force calculation on a GPU is practical and e ﬃ cient. Keywords:

1. Introduction

A technique for gravitational many-body simulations is afundamental tool in astrophysical simulations because the grav-itational force drives structure formation in the universe. Thelength scales that arise in structure formation range from lessthan 1 cm for the aggregation of dust to more than 10 cm forthe formation of cosmological structures. At all scales, grav-ity is a key physical process for the understanding of structureformation. The reason behind this is the long-range nature ofgravity.Suppose we simulate structure formation with N particles.The ﬂow of the many-body simulation is as follows. First, wecalculate the mutual gravitational forces between the N parti-cles, then integrate the orbits for the N particles, and repeatthis process as necessary. Although it is simple in principle,the force calculation is a challenging task from the point ofview of computer science. A simple, exact method for the forcecalculation requires O ( N ) computational complexity, which isprohibitively compute-intensive for large N . An exact forcecalculation is necessary in some types of simulations, such asfew-body problems, the numerical integration of planets orbit-ing around a star (e.g., the Solar System), and the evolution ofdense star clusters. For simulations that do not require exactforces, however, several approximation techniques have beenproposed (Hockney & Eastwood, 1981; Barnes & Hut, 1986;Greengard & Rokhlin, 1987). The particle–mesh / particle–particle–mesh method (Hockney & Eastwood, 1981) and thetree method (Barnes & Hut, 1986) reduce the computationalcomplexity of the force calculation to O ( N log N ). Thefast multipole method (FMM) reduces it further to O ( N )(Greengard & Rokhlin, 1987). Of these methods, the treemethod has been used extensively in astrophysical simulations, since its adaptive nature is essential for dealing with clumpystructure in the universe (e.g., Bouchet & Hernquist, 1988).Despite the O ( N log N ) complexity, computational optimiza-tion of the tree method by techniques such as vectoriza-tion and parallelization is necessary to accommodate demandsfor simulations with larger and larger N . Hernquist (1990),Makino (1990), and Barnes (1990) have reported various tech-niques to vectorize the force calculation with the tree method.Warren et al. (1992), Dubinski (1996), and Yahagi et al. (1999)have reported a parallel tree method for massively parallelprocessors (MPPs). In a recent publication (Springel et al.,2005), a simulation of large-scale structure formation in theuniverse with more than ten billion particles, using a paral-lel tree code running on an MPP, has been reported. Anothercomputational technique to speed up the tree method utilizesthe GRAPE special-purpose computer (Sugimoto et al., 1990;Makino & Taiji, 1998). Using a combination of vectorizationtechniques for the tree method, the tree method can be executede ﬃ ciently on a GRAPE system (Makino, 1991).Cosmological simulations are a “grand challenge” prob-lem. The Gordon Bell prizes have been awarded manytimes for cosmological simulations (Warren & Salmon,1992; Fukushige & Makino, 1996; Warren et al., 1997, 1998;Kawai et al., 1999; Hamada et al., 2009). In those sim-ulations, both parallel tree codes (Warren & Salmon, 1992;Warren et al., 1997, 1998) and a tree code running on a GRAPEsystem (Fukushige & Makino, 1996; Kawai et al., 1999) anda graphics processing unit (GPU) (Hamada et al., 2009) wereused to perform cosmological simulations.In the present paper, we describe our implementation of thetree method on a GPU. The rise of the GPU forces us to re-think our way of doing parallel computing, since the perfor- Preprint submitted to Elsevier October 25, 2018 ance of recent GPUs has reached the impressive level of > O ( N ) complexity. Itis apparent, however, that for applications that do not requireexact forces, it is possible to do much more e ﬃ cient compu-tation by the tree method. We have directly implemented thetree method on a GPU so that we can enjoy the speed of an O ( N log N ) algorithm. For small N < , N ≫ ,

2. GPU architecture

In this section, we brieﬂy summarize the architecture of theCypress GPU that we used in the present work (most of theinformation is taken from AMD Inc. (2010)).

The Cypress GPU, from AMD, is the company’s latest GPUand has many enhancements for general-purpose computing. Ithas 1600 arithmetic units in total. Each arithmetic unit is capa-ble of executing single-precision ﬂoating-point fused multiply–add (FMA) operation. Five arithmetic units make up a ﬁve-way very-long-instruction-word (VLIW) unit called a streamcore (SC). Therefore, one Cypress processor has 320 SCs. OneSC can execute a several combinations of operations such as(1) ﬁve 32-bit integer operations, (2) ﬁve single-precision FMAoperations, (3) four single-precision FMA operations with onetranscendental operation, (4) two double-precision add oper-ations, or (5) one double-precision FMA operations. EachSC has a register ﬁle of 1024 words, where one word is 128bits long (four single-precision words or two double-precisionwords). A group of 16 SCs make up a unit called a computeunit. At the top level of the GPU, there are 20 compute units, acontroller unit called an ultra-threaded dispatch processor, andother units such as units for graphics processing, memory con-trollers, and DMA engines.All SCs in the compute unit work in a single-instruction-multiple-thread (SIMT) mode, i.e., 16 SCs execute the sameinstructions for four clock cycles to accommodate the latencyof the arithmetic units. That is, we have 64 threads proceed-ing as a wavefront on the Cypress GPU. At the time of writ-ing, the fastest Cypress processor runs at 850 MHz and o ﬀ ersa peak performance of 1600 × × × = .

72 Tﬂop / s insingle-precision operations. With double-precision operations,we have 320 × × × =

544 Gﬂop / s. The external memory attached to the Cypress consists of 1GB of GDDR5 memory with a 256 bit bus. It has a data clockrate of 4.8 GHz and o ﬀ ers a bandwidth of 153.6 GB / s. Thisexternal memory is accessed through four banks, as shown inFigure 1. In each bank, there is a second-level read cache (L2cache). The total size of the second-level cache is 512 KB, i.e.,4 ×

128 KB. Twenty compute units and memory controllers areinterconnected through a crossbar. Each compute unit has aﬁrst-level read cache (L1 cache) and a local data share (LDS),as depicted in Figure 1. The sizes of the L1 cache and LDS are8 KB and 32 KB, respectively. The L1 cache can fetch data at54.4 GB / s when the cache is hit; namely, the aggregate band-width of the L1 cache on the Cypress GPU is 54.4 GB / s × = / s. This high memory bandwidth is a notable fea-ture of this GPU. As we shall describe in the following section,taking advantage of the hardware-managed cache is critical toobtaining high performance on the Cypress GPU. In the present work, we programmed the Cypress GPU usingan assembly-like language called IL (Intermediate Language).IL is like a virtual instruction set for GPUs from AMD. WithIL, we have full control of every VLIW instruction. The pro-gramming model supported by IL is a single-instruction-and-multiple-data (SIMD) model at the level of the SC. In this pro-gramming model, a sequence of instructions generated from anIL program is executed on all SCs simultaneously with di ﬀ erentinput data.A block of code written in IL is called a compute kernel. Thedevice driver for a GPU compiles IL instructions into the cor-responding machine code when we load a kernel written in IL.In a compute kernel, we explicitly declare what type of vari-able the input data is. In the main body of the IL code, wewrite arithmetic operations on the input data. Logically, eachSC is implicitly assigned data that is di ﬀ erent from that for ev-ery other SC. In the case of a simple compute kernel, the SCoperates on the assigned data. Operations such as this, as arisein pure stream computing, seem to work with the highest e ﬃ -ciency. In a complex compute kernel, which we explore in thepresent work, each SC not only operates on the assigned databut also explicitly loads random data that might be assigned toanother SC. To accomplish a random access to external mem-ory, we explicitly calculate the address of the data in the com-pute kernel.The ATI Stream software development kit (SDK) for the Cy-press GPU also supports OpenCL, which is a standard API withan extended C language (hereafter referred to as C for OpenCL)for writing a compute kernel. In this work, we also presenta compute kernel written in C for OpenCL (see Appendix Afor the code). We believe that it is instructive to present ouralgorithm in C for OpenCL and that this makes the algorithmeasy to understand. Both programming methods (using IL andusing C for OpenCL) have pros and cons. With IL, we havethe advantage of full control of the VLIW instructions, but acompute kernel written in IL is somewhat cumbersome. Onthe other hand, it is is easier to start writing a compute ker-nel in C for OpenCL, but optimization for any particular GPU2 igure 1: Block diagram of the Cypress GPU, with emphasis on the memory system. architecture is not straightforward. An advantage of program-ming with OpenCL is that we can use OpenCL to program ageneral-purpose many-core CPU. In the following section, wecompare implementations of the tree method on a GPU basedon compute kernels written in IL and in C for OpenCL. We alsocompare the performance of a compute kernel written in C forOpenCL on a GPU and on a CPU.

3. Bare performance of brute force method on a GPU

So far, we have developed around a dozen kernels in IL thatwe use for astrophysical many-body simulations. In this sec-tion, we report the performance of our implementation of abrute force method for computing gravitational forces. Thiscode served as a basis for us to implement a more sophisticatedalgorithm later.To be precise, we have implemented a set of conventionalequations expressed as p i = N X j = , j , i p ( x i , x j , m j ) = N X j = , j , i m j ( | x i − x j | + ǫ ) / , a i = N X j = , j , i f ( x i , x j , m j ) = N X j = , j , i m j ( x i − x j )( | x i − x j | + ǫ ) / , (1) where a i and p i are the force vector and potential for a particle i , and x i , m i , ǫ are the position of the particle, its mass, and aparameter that prevents division by zero, respectively. We cansolve these equations by two nested loops on a general-purposeCPU. In the inner loop, we simultaneously evaluate the func-tions p and f , and require 22 arithmetic operations, which in-clude one square root and one division, to compute the interac-tion between particles i and j . Since previous authors, startingfrom Warren et al. (1997), have used a conventional operationcount for the evaluation of f i and p i , we have adopted a con-ventional count of 38 throughout this paper.Elsen et al. (2006) reported an implementation of a bruteforce method for gravitational and other forces on an oldGPU from AMD / ATi. One of the main insights obtainedwas that a loop-unrolling technique greatly enhanced the per-formance of the code. We have followed Elsen et al.’s ap-proach and tried several di ﬀ erent methods of loop unrolling.Fujiwara & Nakasato (2009) have reported our optimization ef-forts for old GPUs. Here, we present a summary of our results.In Figure 2, we plot the computing speed of our optimized ILkernel for computing Eq.(1) as a function of N . We testedthree GPU boards, namely RV770 GPUs running at 625 and750 MHz and a Cypress GPU running at 850 MHz. The threesystems had peak computing speeds in single precision of 1.04,1.2, and 2.72 Tﬂop / s, respectively. So far, we have obtained amaximum performance of ∼ / s on the Cypress GPU3 G F L O PS N RV770 650MHzRV770 750MHzCypress 850MHz

Figure 2: Performance of the brute force method on various GPUs for N > , N = , O ( N ). Therefore, if we need to do an astrophysicalmany-body simulation for large N , we need a smart algorithmto do the job, since the recent standard for N in astrophysicalsimulations is at least 100 ,

000 for complex simulations withbaryon physics and 1 , ,

000 for simple many-body simula-tions.

4. Tree method on a GPU

The tree method (Barnes & Hut, 1986) is a special case ofthe general k d-tree algorithm. This method has been optimizedto e ﬃ ciently calculate the mutual forces between particles, andreduces the computational complexity of the force calculationfrom O ( N ) for the brute force method to O ( N log N ). A trickused is that instead of computing the exact force by a bruteforce method, it approximates the force from distant particlesusing a multipole expansion. It is apparent that there is a trade-o ﬀ between the approximation error and the way in which wereplace a group of distant particles by a multipole expansion.A tree structure that contains all particles is used to judge thistrade-o ﬀ e ﬃ ciently.The force calculation in the tree method is executed in twosteps: (1) a tree construction and (2) the force calculation. Inthe tree construction, we divide a cube that encloses all of theparticles into eight equal sub-cells. The ﬁrst cell is the root of atree that we construct; it is called the root cell. Then, each sub-cell is recursively subdivided in the same say until each cellcontains zero or one particle. As the result of this procedure,we obtain an oct-tree.In the force calculation, we traverse the tree to judge whetherwe should replace a distant cell that contains a group of particles procedure treewalk(i, cell)if cell has only one particleforce += f(i, cell)elseif cell is far enough from iforce += f_multipole(i, cell)elsefor i = 0, 7if cell->subcell[i] existstreewalk(i, cell->subcell[i]) Figure 3: Pseudocode for the force calculation by traversing the tree that are geometrically close together with the multipole expan-sion of those particles. If we do not replace the cell, we thentraverse sub-cells of the distant cell. If we do replace the cell,we calculate a particle–cell interaction. When we encounter aparticle, we immediately calculate a particle–particle interac-tion. Given a particle with index i which we want to com-pute the force acting on, this procedure is expressed as pseu-docode in Figure 3. Note that subcell[] is a pointers to itsown sub-cells. In this pseudocode, f is a function that com-putes the particle–particle interaction, and f_multipole is afunction that computes the particle–cell interaction. In the workdescribed in this paper, since we considered only the monopolemoment of a cell, both functions were expressed exactly as inEq. (1). In principle, we can use any high-order moment in theparticle–cell interaction.We follow this procedure starting from the root cell, withthe following condition that tests whether a cell is far enoughaway. Let the distance between the particle and the cell be d .The cell is well separated from the particle if l / d < θ , where l is the size of the cell and θ is a parameter that controls thetrade-o ﬀ . Since the smaller l / d is, the more distant the cell isfrom the particle, this condition (called the opening condition)tests geometrically whether the cell is far from the particle. Thisrecursive force-calculation procedure is almost the same as inthe original algorithm of Barnes & Hut (1986).An important feature of the tree method is that with treetraversal, the force calculations for di ﬀ erent particles are com-pletely independent of each other. Therefore, after we havecompleted the tree construction, the force calculation is a mas-sively parallel problem. There are two possible ways to im-plement the tree method on a GPU to take advantage of thisfeature. One way is a method proposed by Makino (1991). Thismethod was proposed as a tree method for the special-purposecomputer GRAPE. A GRAPE system consists of a host com-puter and a GRAPE board or boards. The host computer con-trols the GRAPE board. For a program running on the host, theGRAPE board acts like a subroutine that calculates the gravita-tional forces for given particles.So, we need the following two steps to use a GRAPE systemfor a force calculation using the tree method: (1) construction of4n interaction list on the host computer, and (2) the actual forcecalculation on the GRAPE board. The interaction list is a listof particles and distant cells that are supposed to interact with agiven particle. After the construction of interaction lists for allparticles is completed, we compute the force on each particleby sending the interaction lists to the GRAPE board. These twosteps are necessary because the GRAPE board does not have theability to traverse the tree. Many authors have used this methodextensively. Three winners and a ﬁnalist of the Gordon Bellprize have used a variant of this method with a di ﬀ erent versionof the GRAPE system and a GPU (Fukushige & Makino, 1996;Kawai et al., 1999; Hamada et al., 2009; Kawai & Fukushige,2006). A drawback of this approach is that the performance islimited by the speed of the host computer that is responsiblefor the tree traversal. This possible bottleneck, which is sim-ilar to Amdahl’s law, might be critical without a highly tunedimplementation of treewalk() running on the host. Further-more, in all of the results presented by Fukushige & Makino(1996), Kawai et al. (1999), Kawai & Fukushige (2006), andHamada et al. (2009), extra force evaluations by a factor of twowere required to obtain the best performance. Note that be-cause of the extra force evaluations, the maximum error in theforce that these authors have reported was better than the errorobtained with the conventional tree method for a given θ . Another way to implement the tree method, which we havefollowed in the present work, is to implement the whole pro-cedure shown in Figure 3 on a GPU. The advantage of thisapproach is that only the tree construction, which requires rel-atively little time, is executed on the host, so that we utilizethe massive computing power of the GPU as much as possi-ble. More importantly, we can use our method in applicationsthat require short-range interaction forces (Warren & Salmon,1995). This is because it is possible to implement a neigh-bor search algorithm as a general tree-walk procedure of thekind shown in Figure 4. Two procedures, proc_particle and proc_cell , are used to process the particle–particle andparticle–cell interactions, respectively. In addition, a function distance_test is used to control the treatment of a distantcell. The calculation of the gravitational force is an applicationof the general tree-walk procedure that has been very success-ful.

In our implementation of the tree method on a Cypress GPU,we ﬁrst construct an tree on the host computer that controls theGPU. At this stage, there is no di ﬀ erence between our originaltree code and the newly developed code for the GPU.We need to take special care in implementing the tree-walkprocedure on the GPU. Currently, GPU architecture does notsupport recursive procedures except when it is possible to fullyexpand a recursion. Such a full expansion is possible only ifthe level of the recursion is ﬁxed, but in the tree method, it isimpossible to know how deep the recursion will be without per-forming the tree traversal. So, we adopted a method proposed procedure general_treewalk(i, cell)if cell has only one particleproc_particle(i, cell)elseif distance_test(i, cell) is trueproc_cell(i, cell)elsefor i = 0, 7if cell->subcell[i] existsgeneral_treewalk(i, cell->subcell[i]) Figure 4: Pseudocode for a general tree-walk procedure. procedure treewalk_iterative(i)cell = the root cellwhile cell is not nullif cell has only one particleforce += f(i, cell)cell = cell->nextelseif cell is far enough from iforce += f_multipole(i, cell)cell = cell->nextelsecell = cell->more

Figure 6: Pseudocode for an iterative tree-walk procedure. by Makino (1990) that transforms a recursion in treewalk() into an iteration. A key feature is that for a given cell, we donot need whole pointers ( subcell[] ) to traverse the tree. Weneed only two pointers, to the cells that we will visit next whenthe opening condition is true and when it is false, respectively.These two pointers (hereafter called next[] and more[] ) areeasily obtained by a breadth-ﬁrst traversal of the tree. Figure 5shows next[] and more[] schematically. Note that a cell thathas sub-cells has both a next[] and a more[] pointer, whilea leaf cell (a particle in the present case) with no sub-cells hasonly a next[] pointer. An iterative form of treewalk() withthese two pointers is shown in Figure 6.We implemented the iterative procedure treewalk() ratherdirectly in IL. The input data for this compute kernel is four ar-rays. The ﬁrst contains the positions and masses of the particlesand cells. We pack a position and a mass into a vector variablewith four elements. Therefore, this array is an array of four-element vectors. The mass of a cell equals the total mass of theparticles in the cell, and the position of the cell is at the centerof mass of the particles. The second and third arrays containthe “next” and “more” pointers, respectively. Both of these aresimple arrays. The fourth array contains the sizes of the cells.The size of the cell is necessary for testing the opening con-dition. See the description in Appendix Appendix A for thedeﬁnitions of these arrays.In the present work, we adopted the following modiﬁed5 igure 5: A tree with “more” and “next” pointers, shown by blue and red arrows, respectively. opening condition, expressed as l θ + s < d , (2)where s is the distance between the center of the cell and thecenter of mass of the cell. The modiﬁed condition of Eq. (2)takes the particle distribution in the cell into account through s since if particles gather at a corner of a cell, the e ﬀ ective sizeof the cell becomes larger. In Figure 7, we present a schematicview of a distant cell and a particle which we are trying to calcu-late the force acting on. In practice, we precomputed the squareof the e ﬀ ective size S e ﬀ ective as S e ﬀ ective = l θ + s ! , (3)and sent S e ﬀ ective instead of l for each cell. With S e ﬀ ective , wedo not need to compute the square root of d , and we simplycompare S e ﬀ ective and d during the tree traversal.In Figure 8, we present an abstract version of our computekernel written in IL. In IL programming, each SC executes thecompute kernel with the assigned data in parallel. In this code, own represents the speciﬁc cell assigned to each SC; = , load ,and -> are not real IL instructions or operations but conven-tional symbols used here for the purpose of explanation. Wehave omitted the calculation of the load addresses for the ar-rays since it is too lengthy to show in detail. In addition, theparticle–particle and particle–cell interaction codes have beenomitted because they simply compute the functions f and p inEq. (1). In Appendix A, we present a working compute kernelwritten in C for OpenCL. We present a performance compari-son between the IL and OpenCL implementations in the nextsection. Table 1: Our test system

CPU Intel Xeon E5520 × × host ).2. Compute the total mass, the center of mass, and the e ﬀ ec-tive size of each cell ( host ).3. Compute the “next” and “more” pointers ( host ).4. Send the input data to the GPU ( host ).5. Iterative tree walk associated with the force calculation foreach particle ( GPU ).6. Receive the force for each particle from the GPU ( host ).We have indicated whether the corresponding part is executedon the host or the GPU in bold text at the end of each step.

5. Tests and optimization

Here, we describe the results of some basic tests to showthat our code worked correctly, and to obtain some performancecharacteristics. We used the conﬁguration shown in Table 1 for6 center of the cell center of mass of the particles s d a particle where we compute the force ' q Figure 7: Schematic view of a distant cell and a particle (shown by a solid purple point). The black solid points are particles that belong to the cell. The large redpoint is the center of mass of the particles in the cell. all results presented in this paper. In the basic tests, we used aset of particles randomly distributed in a sphere.First, Table 2 shows how the computing time depends on N .Each value of computing time was obtained by averaging the re-sults of 20 runs. In this test, we set θ = . T total and T construction are the total time required for the force calculation and the timespent on the construction of the tree, respectively. Roughly,the tree construction took 20–28% of T total . For all values of N used, we checked that there was e ﬀ ectively no error in theforce computed by the GPU. All operations on the GPU weredone with single-precision, and we observed that the error wascomparable to the machine epsilon, ∼ − . We believe thatthe error originates from a di ﬀ erence in the implementationsof the inverse of the square root on the host and on the GPU.We consider that this is not at all signiﬁcant for our purpose ofastrophysical many-body simulations.Regarding computing speed, randomly distributed particlesare the most severe test because two successive particles in theinput data have a very high chance of being at di ﬀ erent posi- Here, the “error” is not the error due to the approximations in the forcecalculations. Table 2: Dependence of computing speed on N : no sorting N T total (s) T construction (s)50K 3 . × − . × − . × − . × − . × − . × − . × − . × − . × . × − ..declaration of I/O arrays and constants......initialize variables for accumulation...xi = load own->xyi = load own->yzi = load own->zcell = rootwhileloopbreak if cell is nullxj = load cell->xyj = load cell->yzj = load cell->zmj = load cell->ms_eff = load cell->s_effdx = xj - xidy = yj - yidz = zj - zir2 = dx*dx + dy*dy + dz*dzif cell is a particle...compute particle-particle interaction...cell = load nextelseif r2 > s_eff...compute particle-cell interaction...cell = load cell->nextelsecell = load cell->moreendifendloop Figure 8: Abstract IL code for our compute kernel that executes the iterativetree walk. Table 3: Dependence of computing speed on N : particles sorted in Mortonorder N T total (s) T construction (s)50K 3 . × − . × − . × − . × − . × − . × − . × − . × − . × − . × − tions. By the nature of the tree method, if two particles areclose to each other, those particles are expected to be in thesame cell and to interact with a similar list of particles and dis-tant cells. This means that if two successive particles in theinput data are geometrically close, the tree walk for the sec-ond particle almost certainly takes less time owing to a highercache-hit rate. To accomplish such a situation, we can sort theparticles to ensure that successive particles are as close as pos-sible together. Fortunately, such sorting is easily available withthe tree method by traversing the tree in depth-ﬁrst order. Inthe course of the traversal, we add each particle encountered ata leaf node to a list. After the tree traversal, we can use thelist obtained to shu ﬄ e the particles so that the order of parti-cles is nearly the desired order. This ordering of particles iscalled the Morton ordering. With this preprocessing, the speedof our method was altered as shown in Table 3. Note that thetime in Table 3 does not contain the time required for the pre-processing. This is adequate, since in astrophysical many-bodysimulations, the tree structure is repeatedly constructed at eachtime step so that we can automatically obtain this sorting forfree. We observed that T total obtained with the Morton orderingwas faster by a factor of 1.5–2.2, depending on N , than withoutthe preprocessing. Moreover, T construction also decreased in allcases owing to better cache usage on the host. With the Mortonordering, the tree construction took roughly 14–27% of T total .The programming API for the Cypress GPU has a facility toreport the cache-hit rate for the GPU. In Table 4, we show howthe cache-hit rate depends on N and the ordering of the parti-cles. The results indicate that the performance of our methodis signiﬁcantly a ﬀ ected by the ordering of the particles. In thetests described in the following, we always used preprocessing.Note that we could have obtained even better results if we hadsorted the particles in the Peano–Hilbert order, which has beenreported to be the optimal order for locality of data access, andis used by some tree codes (e.g., Warren & Salmon, 1993).In Figure 9, we present T total as a function of N for threecases: the tree method with Morton ordering, the tree methodwithout sorting, and the brute force method. Except for N < , θ = .

6) out-performs the brute force method on the GPU.In Figure 10, we compare the performance for the follow-ing three cases: (a) a kernel written in IL running on a CypressGPU, (b) a kernel written in C for OpenCL running on a Cy-8 able 4: Dependence of cache-hit rate on N for di ﬀ erent orderings of the parti-cles N No sorting (%) Morton ordering (%)50K 75 93100K 63 91200K 55 87400K 48 85800K 43 80 s e c NIL (no sort)IL (Morton)IL (brute force)N log NN Figure 9: Comparison between the tree method on a GPU and the brute forcemethod on a GPU. T total as a function of N is plotted for three cases. The N and N log N scaling lines are also plotted for reference. press GPU, and (c) a kernel written in C for OpenCL runningon a multicore CPU. Since our test system had eight physical(16 logical) cores, the OpenCL kernel ran on the CPU with 16threads. The last two cases show almost identical performanceeven though the theoretical performance in single precision forthe Cypress GPU is ∼ ﬀ ective performance of the compute kernel writtenin IL is roughly ∼

1% of the theoretical performance in single-precision. On the other hand, the performance gap between thetwo kernels written in IL and C for OpenCL is a factor of ∼ T total depends on θ , which controls theerror bound for the tree method. A larger θ means that more ofthe distant particles are replaced by a multipole expansion. Inother words, for a smaller θ , we need to perform a larger numberof force calculations, and hence the computation will take alonger time. At the same time, the error due to the multipoleexpansion decreases. Practically, a force calculation by the treemethod with θ < . ﬀ ectively, wedo not have any preference for the tree method. In Table 5, we s e c NILOpenCL GPUOpenCL CPUN log N

Figure 10: Comparison between the tree method using a kernel written in IL,the tree method using a kernel written in OpenCL on a GPU, and the treemethod using a kernel written in OpenCL on a CPU. The N log N scaling lineis also plotted for reference.Table 5: Dependence of T total and the cache-hit rate on θ for N = θ T total (s) Cache-hit rate (%)0.2 1 . × . × . × . × . × − . × − . × − . × − . × − T total and the cache-hit rate on θ . In thistest, we used N = a error = N N − X i = | a tree i − a direct i || a direct i | , (4)where a tree i and a direct i are the acceleration forces obtained bythe tree method on the GPU and by the brute force method onthe host computer, respectively. We use a similar deﬁnition forthe error in the potential, p error . For this test, we used a realis-tic particle distribution that represented a disk galaxy. We usedGalactICS (Kuijken & Dubinski, 1995) to generate the particledistribution. The particle distribution had three components,9amely a bulge, a disk, and a halo, with a mass ratio of approx-imately 1:2:12. Tables 6 and 7 present a error and p error , respec-tively, for several di ﬀ erent values of N and θ . Both a error and p error depend on θ as expected. Except for N = a error for larger N . We found no systematic error in a i computed by the tree code on the GPU. However, it wouldbe desirable to use double-precision variables for accumulationof a i to reduce a error for large N > . There is only a negli-gible di ﬀ erence between the results computed by the computekernels written in IL and in C for OpenCL.

6. Comparison with other work

Lefebvre et al. (2005) have implemented an tree data struc-ture for a texture-mapping and tree traversal algorithm on aGPU. Owing to the limitations of GPUs and on the SDK forthe GPUs at that time, their method seemed to be restricted toapplications in computer graphics. A critical point is that thepossible depth of the tree was limited, so that we cannot di-rectly employ this implementation for our purposes.

Gaburov et al. (2010) have reported another implementationof the tree method on a GPU. Our implementation and theirapproach share the same strategy, but there are di ﬀ erences indetail, aside from the GPU architecture adopted. Both of ushave implemented a tree walk on a GPU. The implementationof Gaburov et al. (2010) constructs interaction lists by meansof a tree walk on a GPU and then computes the force on theGPU using these interaction lists. This implementation requiresthree invocations of kernels. In contrast, we do a tree walk andcompute the force on-the-ﬂy with one kernel invocation.Both approaches have pros and cons. With the Gaburov et al.(2010) approach, a fairly high compute e ﬃ ciency ( ∼ ∼

30 Gﬂops). On the other hand, Gaburov et al.(2010)’s code requires more ﬂoating-point operations than doesour optimal tree code. We believe that our implementationis simpler than that of Gaburov et al. (2010), which requiresmulti-pass computations. And we also believe that our im-plementation is easier to extend to a general tree walk. Infact, we have extended our compute kernel written in IL tothe SPH method (Gingold & Monaghan, 1977; Lucy, 1977) andobtained fairly good performance.

Gumerov & Duraiswami (2008) have reported an implemen-tation of a fast multipole method (FMM) on a GPU. The FMMis a sophisticated algorithm to evaluate long-range interactionforces with a computational complexity of O ( N ). In the FMM,in addition to the replacement of distant particles with multi-pole expansions, local expansions are utilized to evaluate theforce acting on a group of particles. Gumerov & Duraiswami(2008) reported that for N = , , p = p is a parameter that controls the error bound. Figure10 of Gumerov & Duraiswami (2008) indicates that the aver-age relative error obtained with p = ∼ × − for thepotential. The error is comparable to the relative error obtainedwith the tree method with θ ∼ N = , ,

576 with θ = .

6. The cache-hit rate of the test was 81%. The perfor-mance of our tree code and that obtained by Gumerov and Du-raiswami with the FMM is comparable. Note that the CypressGPU used in the present work was di ﬀ erent from and newerthan the GPU that Gumerov and Duraiswami used. Generally,the FMM is well suited to applications that require long-rangeinteraction forces for uniformly distributed particles or sources,whereas the tree method is more robust to the highly clusteredparticles that typically arise in astrophysical many-body simu-lations. We believe that our method is more suitable than that ofGumerov & Duraiswami (2008) for our purpose of astrophysi-cal applications.

7. Conclusions

In this paper, we have described our implementation of thetree method on a Cypress GPU. By transforming a recursivetree traversal into an iterative procedure, we have shown thatthe execution of a tree traversal together with a force calcula-tion on a GPU can be practical and e ﬃ cient. In addition, ourimplementation shows performance comparable to that of a re-cently reported FMM code on a GPU.We can expect to get further performance gains by fully uti-lizing the four-vector SIMD operations of the SCs of the GPU.Moreover, since 10–20% of T total is spent on the tree construc-tion, parallelization of this part on a multicore CPU will be ane ﬀ ective way to boost the total performance. Provided that wecan easily extend our code to implement a force calculation forshort-range interactions by a method such as the SPH method,we believe that a future extended version of our code will en-able us to do a realistic astrophysical simulation that involvesbaryon physics with N > , ,

000 very rapidly. It is fairlyeasy to incorporate higher-order multipole expansion terms intoour method, and it would be a natural extension to the presentwork. Another good application of the tree method on a GPUwould be to simulations that adopt a symmetrized Plummer po-tential (Saitoh & Makino, 2010). We believe that our methodis the best for implementing that proposal, and hence that weshall certainly obtain better accuracy and good performance insimulating galaxy evolution and formation with di ﬀ erent massresolutions. Acknowledgments

The author would like to thank M. Sato and K. Fujiwara fortheir e ﬀ orts to utilize RV770 GPUs for astrophysical many-body particle simulations. Part of this work is based on theirundergraduate theses of 2008 at the University of Aizu. The10 able 6: Dependence of the average acceleration error a error on N and θ . N is a multiple of 1024. θ N = N = N = N = N = Table 7: Dependence of the average potential error p error on N and θ . θ N = N = N = N = N = Appendix A. Working OpenCL code

In this appendix, we present our compute kernel written in Cfor OpenCL. We have tested this kernel with ATI Stream SDK2.2. The function tree_gm is an entry point to the kernel. n isthe number of particles. pos[i] is a float4 variable for thepositions and masses, i.e., x i and m i . acc_g[i] is a float4 variable for the accelerations and the potential, i.e., a i and p i . next[] and more[] are arrays that contains the two pointersdescribed in Section 4.4. For all arrays, we assume that the datafor the particles is at indices i from 0 to n −

1. The data forthe cells resides at i > = n . Finally, size[] contains softeningparameters of the particles for i < n and the size of the cellsfor i > = n . We believe that it will be straightforward to preparethese arrays from any tree implementation. References

AMD Inc. (2010). ATI Stream Computing OpenCL Programming Guide,rev1.05.Barnes, J. (1990). A Modiﬁed Tree Code: Don’t Laugh: It Runs.

Journal ofComputational Physics , , 161–170.Barnes, J., & Hut, P. (1986). A Hierarchical O ( N log N ) Force-CalculationAlgorithm. Nature , , 446–449.Belleman, R. G., B´edorf, J., & Portegies Zwart, S. F. (2008). High PerformanceDirect Gravitational N -body Simulations on Graphics Processing Units II:An Implementation in CUDA. New Astronomy , , 103–112.Bouchet, F. R., & Hernquist, L. (1988). Cosmological Simulations using theHierarchical Tree Method. Astrophysical Journal Supplement , , 521–538.Dubinski, J. (1996). A Parallel Tree Code. New Astronomy , , 133–147.Elsen, E., Houston, M., Vishal, V., Darve, E., Hanrahan, P., & Pande, V. (2006). N -body simulation on GPUs. In SC ’06: Proceedings of the 2006 ACM / IEEEConference on Supercomputing (p. 188). New York: ACM.Fujiwara, K., & Nakasato, N. (2009). Fast Simulations of Gravitational Many-body Problem on RV770 GPU.

ArXiv e-prints , .Fukushige, T., & Makino, J. (1996). In

Proceedings of Supercomputing ’96 .Los Alamitos: IEEE Computer Society Press.Gaburov, E., B´edorf, J., & Portegies Zwart, S. F. (2010). Gravitational Tree-code on Graphics Processing Units: Implementation in CUDA.

ProcediaComputer Science , , 1113–1121. ICCS 2010.Gingold, R. A., & Monaghan, J. J. (1977). Smoothed Particle Hydrodynam-ics: Theory and Application to Non-spherical Stars. Monthly Notices of theRoyal Astronomical Society , , 375–389.Greengard, L., & Rokhlin, V. (1987). A Fast Algorithm for Particle Simula-tions. Journal of Computational Physics , , 325–348.Gumerov, N. A., & Duraiswami, R. (2008). Fast multipole methods on graphicsprocessors. Journal of Computational Physics , , 8290–8313.Hamada, T., & Iitaka, T. (2007). The Chamomile scheme: An Optimized Algo-rithm for N -body Simulations on Programmable Graphics Processing Units. ArXiv Astrophysics e-prints , .Hamada, T., Narumi, T., Yokota, R., Yasuoka, K., Nitadori, K., & Taiji, M.(2009). 42 TFlops Hierarchical N -body Simulations on GPUs with Appli-cations in Both Astrophysics and Turbulence. In SC ’09: Proceedings ofthe Conference on High Performance Computing Networking, Storage andAnalysis (pp. 1–12). New York: ACM.Hernquist, L. (1990). Vectorization of tree traversals.

Journal of ComputationalPhysics , , 137–147.Hockney, R., & Eastwood, J. (1981). Computer Simulation Using Particles .New York: McGraw-Hill.Kawai, A., & Fukushige, T. (2006). $158 / GFLOPS astrophysical N -body sim-ulation with reconﬁgurable add-in card and hierarchical tree algorithm. In SC ’06: Proceedings of the 2006 ACM / IEEE Conference on Supercomputing (p. 48). New York: ACM.Kawai, A., Fukushige, T., & Makino, J. (1999). $7.0 / Mﬂops astrophysical N -body simulation with treecode on GRAPE-5. In Supercomputing ’99: Pro-ceedings of the 1999 ACM / IEEE Conference on Supercomputing (CDROM) (p. 67). New York: ACM.Kuijken, K., & Dubinski, J. (1995). Nearly self-consistent disc / bulge / halo mod-els for galaxies. Monthly Notices of the Royal Astronomical Society , ,1341–1353.Lefebvre, S., Hornus, S., & Neyret, F. (2005). Octree Textures on the GPU. GPU Gems , , 595–614.Lucy, L. B. (1977). A Numerical Approach to the Testing of the Fission Hy-pothesis. Astronomical Journal , , 1013–1024.Makino, J. (1990). Vectorization of a treecode. Journal of ComputationalPhysics , , 148–160.Makino, J. (1991). Treecode with a special-purpose processor. Publications ofthe Astronomical Society of Japan , , 621–638.Makino, J., & Taiji, M. (1998). Scientiﬁc Simulations with Special-PurposeComputers: The GRAPE Systems . New York: John Wiley and Sons.Nyland, L., Harris, M., & Prins, J. (2007). Fast N -body Simulation with CUDA.In GPU Gems3 (pp. 677–696). New York: Addison-Wesley.Portegies Zwart, S. F., Belleman, R. G., & Geldof, P. M. (2007). High-performance Direct Gravitational N -body Simulations on Graphics Process-ing Units. New Astronomy , , 641–650.Saitoh, T. R., & Makino, J. (2010). The Natural Symmetrization for PlummerPotential. ArXiv e-prints , .Springel, V., White, S. D. M., Jenkins, A., Frenk, C. S., Yoshida, N., Gao,L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J. A., Cole, S.,Thomas, P., Couchman, H., Evrard, A., Colberg, J., & Pearce, F. (2005).Simulations of the Formation, Evolution and Clustering of Galaxies andQuasars.

Nature , , 629–636.Sugimoto, D., Chikada, Y., Makino, J., Ito, T., Ebisuzaki, T., & Umemura, M.(1990). A Special-purpose Computer for Gravitational Many-body Prob-lems. Nature , , 33–35.Warren, M., Germann, T., Lomdahl, P., Beazley, D., & Salmon, J. (1998).Avalon: An Alpha / Linux cluster achieves 10 Gﬂops for $15k. In

Super-computing ’98: Proceedings of the 1998 ACM / IEEE Conference on Super-computing (CDROM) (pp. 1–11). Washington, DC: IEEE Computer Society.Warren, M., & Salmon, J. (1992). Astrophysical N -body simulations usinghierarchical tree data structures. In Supercomputing ’92: Proceedings ofthe 1992 ACM / IEEE Conference on Supercomputing (pp. 570–576). LosAlamitos, CA, USA: IEEE Computer Society Press.Warren, M., & Salmon, J. (1993). A parallel hashed oct-tree N -body algorithm.In Supercomputing ’93: Proceedings of the 1993 ACM / IEEE Conference onSupercomputing (pp. 12–21). New York: ACM.Warren, M., Salmon, J., Becker, D., Goda, M., Sterling, T., & Wickelmans,G. (1997). In

Proceedings of Supercomputing ’97 . Los Alamitos: IEEEComputer Society Press.Warren, M. S., Quinn, P. J., Salmon, J. K., & Zurek, W. H. (1992). Dark HalosFormed via Dissipationless Collapse. I: Shapes and Alignment of AngularMomentum.