[PDF] Towards accelerating Smoothed Particle Hydrodynamics simulations for free-surface flows on multi-GPU clusters

Abstract

Starting from the single graphics processing unit (GPU) version of the Smoothed Particle Hydrodynamics (SPH) code DualSPHysics, a multi-GPU SPH program is developed for free-surface flows. The approach is based on a spatial decomposition technique, whereby different portions (sub-domains) of the physical system under study are assigned to different GPUs. Communication between devices is achieved with the use of Message Passing Interface (MPI) application programming interface (API) routines. The use of the sorting algorithm radix sort for inter-GPU particle migration and sub-domain halo building (which enables interaction between SPH particles of different subdomains) is described in detail. With the resulting scheme it is possible, on the one hand, to carry out simulations that could also be performed on a single GPU, but they can now be performed even faster than on one of these devices alone. On the other hand, accelerated simulations can be performed with up to 32 million particles on the current architecture, which is beyond the limitations of a single GPU due to memory constraints. A study of weak and strong scaling behaviour, speedups and efficiency of the resulting programis presented including an investigation to elucidate the computational bottlenecks. Last, possibilities for reduction of the effects of overhead on computational efficiency in future versions of our scheme are discussed.

Full PDF

aa r X i v : . [ phy s i c s . c o m p - ph ] O c t Towards accelerating Smoothed Particle Hydrodynamics simulationsfor free-surface ﬂows on multi-GPU clusters

Daniel Valdez-Balderas a , Jos´e M. Dom´ınguez b , Benedict D. Rogers a, ∗ , Alejandro J.C. Crespo b a Modelling and Simulation Centre (MaSC), School of Mechanical, Aeroespace & Civil Engineering, University of Manchester, Manchester, M139PL, UK b Environmental Physics Laboratory, Universidad de Vigo, Ourense, Spain

Abstract

Starting from the single graphics processing unit (GPU) version of the Smoothed Particle Hydrodynamics (SPH)code DualSPHysics, a multi-GPU SPH program is developed for free-surface ﬂows. The approach is based on aspatial decomposition technique, whereby di ﬀ erent portions (sub-domains) of the physical system under study areassigned to di ﬀ erent GPUs. Communication between devices is achieved with the use of Message Passing Interface(MPI) application programming interface (API) routines. The use of the sorting algorithm radix sort for inter-GPUparticle migration and sub-domain “halo” building (which enables interaction between SPH particles of di ﬀ erent sub-domains) is described in detail. With the resulting scheme it is possible, on the one hand, to carry out simulations thatcould also be performed on a single GPU, but they can now be performed even faster than on one of these devicesalone. On the other hand, accelerated simulations can be performed with up to 32 million particles on the currentarchitecture, which is beyond the limitations of a single GPU due to memory constraints. A study of weak and strongscaling behaviour, speedups and e ﬃ ciency of the resulting program is presented including an investigation to elucidatethe computational bottlenecks. Last, possibilities for reduction of the e ﬀ ects of overhead on computational e ﬃ ciencyin future versions of our scheme are discussed. Keywords:

Smoothed Particle Hydrodynamics, SPH, CUDA, GPU, multi-GPU, Graphics Processing Unit,computational ﬂuid dynamics, Molecular Dynamics

1. Introduction

The applicability of particle-based simulations is typ-ically limited by two di ﬀ erent but related computationalconstraints: simulation time and system size. That is, toobtain physically meaningful information from a sim-ulation, one must be able to simulate a large-enoughsystem for long-enough times. In the particular case ofthe Smoothed Particle Hydrodynamics (SPH) method,certain types of applications, for example the studyof coastal processes and ﬂooding hydrodynamics, havebeen limited until now by the maximum number of par-ticles in order to perform simulations within reasonabletimes. ∗ Corresponding author

Email addresses:

[email protected] (Daniel Valdez-Balderas), [email protected] (Benedict D. Rogers)

To overcome these limitations, various types of ac-celeration techniques have been employed, which canbe grouped into three main categories based on the typeof hardware used. On the one hand there are the tradi-tional High Performance Computing (HPC) techniqueswhich involve the use of hundreds or thousands of com-puting nodes, each hosting one or more Central Process-ing Units (CPUs) containing one or more computingcores. Those nodes are interconnected via a computernetworking technology (e.g., Ethernet, Inﬁniband, etc.),and programmed with the help of protocols like theMessage Passing Interface (MPI). For SPH, examples ofthis type of approach include the work of Maruzewski et al. [1], who carried out SPH simulations with up to124 million particles on as many as 1024 cores on theIBM Blue Gene / L supercomputer. Another recent ex-ample in this ﬁeld is that of Ferrari et al. [2], who re-ported calculations using up to 2 million particles on afew hundred CPUs. The drawback of this type of ap-proach comes from the fact that, for SPH, an enormous

Preprint submitted to Elsevier October 29, 2018 umber of cores is needed, which require considerableinvestment, including the purchase, maintenance, andpower supply requirements of this type of equipment.A second type of acceleration approach is that involv-ing the use of Field-Programmable Gate Arrays (FP-GAs). For example, Spurzem et al. [3] carried outSPH and gravitational N-body simulations using thistype of technology for astrophysics problems, ﬁndingthat it is useful for acceleration of complex sequencesof operations like those used in SPH. Since the mainuse of FPGAs in the literature is in the ﬁeld of astro-physics (where long-range forces and variable smooth-ing lengths are typically employed), the use of FP-GAs for free-surface ﬂow applications (with short-rangeforces and ﬁxed smoothing lenghts) remains a relativelyunexplored ﬁeld.The third type of acceleration technique used in SPHsimulations, and the one on which this article focuses,involves the use of a type of hardware di ﬀ erent from theCPU: the Graphics Processing Unit (GPU). The devel-opment of GPU technology is driven by the computergames industry but has recently been exploited for non-graphical calculations, leading to the development ofgeneral purpose GPUs (GPGPUs). GPU programmingis a parallel approach because each of these devices con-tains hundreds of computing cores, and multiple threadsof execution are launched simultaneously. The use ofGPUs for scientiﬁc computations has come to representan exciting alternative for the acceleration of scientiﬁccomputing software. The release of the Compute Uni-ﬁed Device Architecture (CUDA) and its software de-velopment kit (SDK) by NVIDIA in 2007 has facilitatedthe popularization of the use of these devices for generalpurposes, but e ﬀ orts in this direction existed even priorto that date. For example, as early as 2004, Amada etal. [4] carried out SPH simulations for real-time sim-ulations of water. In 2007 Harada et al. [5] reportedSPH simulations that ran an order of magnitude fasteron GPUs than on CPUs. More recently H´erault et al. [6], reported one to two orders of magnitude speedupsof the GPU when compared to a single CPU. Amongthe recent e ﬀ orts for SPH computations on GPUs, Du-alSPHysics [7] has proven to be a versatile computa-tional SPH code for both CPU and GPU calculations.The GPU version of DualSPHysics maintains the sta-bility and accuracy of the CPU version, while providingsigniﬁcant speedups over the latter. For a detailed de-scription of DualSPHysics, please refer to [8].Recently, in addition to performance, the energy ef-ﬁciency of di ﬀ erent types of hardware is representingan increasingly important factor when choosing hard-ware for accelerating simulations. This can be seen, for example, in the list of the fastest supercomputers [9],which now include energy e ﬃ ciency along with perfor-mance speciﬁcations. In this realm, too, GPUs showpromise. For example, a recent article by McIntosh-Smith et al. [10], describing a methodology for en-ergy e ﬃ ciency comparisons, reports that in a particularcase study of simulations of molecular mechanics-baseddocking applications, GPUs delivered both better per-formance and higher energy e ﬃ ciency than multi-coreCPUs.However, GPU technology has its own constraints.In the particular case of SPH, the memory restricts themaximum number of SPH particles that can be e ﬃ -ciently simulated on a single device to approximatelyten million or less, depending on the particular GPUand on the precision of the data types used (e.g., singleor double). To go beyond this limit, this work extendsDualSPHysics by introducing a spatial decompositionscheme with the use of the Message Passing Interface(MPI) protocol to obtain a multi-GPU SPH application.Thus, our approach combines the two types of paral-lelization described above since it provides paralleliza-tion, ﬁrst, at a coarse level by decomposing the problemand assigning di ﬀ erent volume portions of the system todi ﬀ erent GPUs, and second, at a ﬁne level, by assigningone GPU thread of execution to each SPH particle. Ourresulting multi-GPU SPH scheme, illustrated in Fig. 1,has the potential to bypass both the system size and sim-ulation time limitations which have to date constrainedthe applicability of SPH in various engineering ﬁelds.The rest of this article is organized as follows: Sec-tion 2 presents a brief overview of the main ideas be-hind SPH. In Section 3 the methodology used to imple-ment the single-GPU version [8] (which is the startingpoint for the present work) is summarized. Section 4describes our spatial decomposition, multi-GPU algo-rithm, whereby the physical system is divided into sub-domains, each of which is assigned to a di ﬀ erent GPU.Section 5 describes the hardware. Section 6 presents themain results of our simulations using a small number ofGPUs, including a study of weak and strong scaling, thebottlenecks of the program, and our strategy to diminishthe e ﬀ ect of those bottlenecks and improve e ﬃ ciency.Section 7 concludes with a summary and a descriptionof ongoing and future work. This work started prior to the release of CUDA 4.0, which allowsdirect communications among multiple GPUs. See Section 4 for moredetails. a) (b)Figure 1: Snapshots of a multi-GPU SPH simulation using three GPUs, for a dam break with three obstacles. Portions of ﬂuid in di ﬀ erentsub-domains are displayed with a di ﬀ erent color, and white lines near the sub-domain boundaries correspond to the halos.

2. Smoothed Particle Hydrodynamics

In order to better understand the speciﬁc challengesposed by a multi-GPU implementation of SmoothedParticle Hydrodynamics (SPH), a brief description ofthis method is now presented. SPH is a mesh-free,Lagrangian technique particularly well suited to studyproblems that involve highly non-linear deformation ofﬂuid surfaces that occur in free-surface ﬂows, such aswave breaking and marine hydrodynamics [11]. De-veloped originally for the study of astrophysical phe-nomena, it now enjoys popularity in a variety of engi-neering ﬁelds, such as civil engineering (e.g., the designof coastal defenses), mechanical engineering (e.g., im-pact fracture in solid mechanics studies), and metallurgy(e.g., mould ﬁlling).The problem in SPH consists of determining the evo-lution in time of the properties of a set of particles repre-senting the ﬂuid. In engineering applications, particleshave short range interactions with their neighbours, andthe dynamics are governed by a set of simultaneous or-dinary di ﬀ erential equations in time. In this section theclassical SPH formulation used in our simulations aredescribed (see [12]). The starting point for the derivation of the SPH equa-tions is the set of equations for the continnum descrip- tion of dynamic ﬂuid ﬂow, namely, the Navier-Stokesequations [13]: D ρ Dt = − ρ ∇· ~ v , (1) D ~ vDt = − ρ ∇ p + ~ g , (2) DeDt = − p ρ ∇· ~ v . (3)Here, ρ is the density, t is time, ~ v and ~ r represent velocityand position, respectively, σ is the total stress tensor, e is the energy, ~ g is the force due to gravity, and D / Dt represents the material or total time derivative. Two key steps are used to discretize the Navier-Stokes equations: ﬁrst, the integral representation of aﬁeld function, or kernel approximation, and second, the particle approximation. The value of a ﬁeld function f ( ~ r ) at point ~ r can be approximated by a weighted inter-polation over the neighbouring volume in the followingway: f ( ~ r ) ≃ Z Ω f ( ~ r ′ ) W ( ~ r − ~ r ′ , h ) d ~ r ′ , (4)where W ( ~ r − ~ r ′ , h ) is a smoothing function (also calledthe smoothing kernel), h is the smoothing length (used3 a) (b)Figure 2: Illustration of the smoothing function, W ( ~ r i − ~ r j , h ), with a support domain of size 2 h . The blue circles are the particles that represent theﬂuid. Particle i interact with particle j only if | ~ r i − ~ r j | < h , due to the compact support property of the smoothing function, as explained in the text. to characterize the shape and range of W ), and the inte-gral is over all space Ω .The smoothing function is typically chosen to havethe following properties: its integral over all space isunity; it approaches the Dirac delta function as h → support domain | ~ r − ~ r ′ | > κ h , where κ is a scaling fac-tor used to set the coarseness of the discretization ofspace (throughout this article κ = | ~ r − ~ r ′ | increases; and, last, it is symmet-ric W ( ~ r − ~ r ′ , h ) = W ( | ~ r − ~ r ′ | , h ). It can be shown thatthe kernel approximation is accurate to order two in h , O ( h )[14]. In this article, all of the results are obtainedusing the cubic spline smoothing function [13] with aconstant support domain of size 2 h .The second step for the discretization of the Navier-Stokes equations is the particle approximation, whichconsists simply of replacing the continuous integral inEq. (4) by a discrete sum, and writing the di ﬀ erential ofvolume as a ﬁnite volume ∆ V j in terms of density andmass, obtaining f ( ~ r i ) ≃ X j f ( ~ r j ) W ( ~ r i − ~ r j , h ) m j ρ j (5)where m j = ∆ V j ρ j is the mass of the SPH particle, andthe sum is performed over all neighboring particles inthe support domain of W . Note that, in contrast with SPH simulations in astrophysics, e.g.,Acreman et al. [15], a variable smoothing length h is not used here,since it is rarely required for simulations of free-surface ﬂows. Avariable smoothing length would increase the severity of divergingbranches, load unbalance among threads, and irregular memory ac-cesses, making an e ﬃ cient GPU implementation of SPH a more chal-lenging task than a ﬁxed smoothing length implementation, like theone described in this article. Figure 2 illustrates some of the properties of thesmoothing function W ( ~ r i − ~ r j , h ). Part (a) of this ﬁg-ure shows particle i interacting with all other parti-cles, j , within its radius of inﬂuence, or support do-main, whereas part (b) shows the compact support inoperation, the smoothness of the function, as well asits monotonically decreasing behaviour for increasing | ~ r i − ~ r j | .Following a similar argument, the particle approxi-mation for the spatial derivative of a function can bewritten in the following form: ∇ f ( ~ r i ) ≃ X j [ f ( ~ r j ) − f ( ~ r i )] ∇ i W ( ~ r i − ~ r j , h ) m j ρ j (6)Mathematically, therefore, the problem consists of per-forming localised interpolations or summations aroundeach computation point where the properties of the ﬂuidare evaluated.In the present article, variations in the thermal prop-erties of the ﬂuid are neglected, governed by Eqn. (3), asthe primary application of our approach is free-surfaceﬂows where thermal properties do not play a signiﬁ-cant role. Applying the SPH formulation to the govern-ing equations (1-2) leads to the following set of equa-tions [13]: d ρ i ( t ) dt = X j m j ~ v i j · ∇ i W i j (7) d ~ r i ( t ) dt = ~ v i ( t ) (8) d ~ v i ( t ) dt = − X j m j p i ρ i + p j ρ j + Π i j ! ∇ i W i j + ~ g (9)where W i j ≡ W ( ~ r − ~ r j , h ), ~ g is gravity, and the artiﬁ-4ial viscosity is given by Π i j = − α c ij ν ij ρ ij if ~ r i j · ~ v i j < Π i j = ν i j = h ~ v i j · ~ r i j / ( ~ r i j + η ), ~ r i j ≡ ~ r i − ~ r j , ~ v i j ≡ ~ v i − ~ v j , c i j ≡ ( c i + c j ) / η = . h , α is a parameter that can be related to the Reynolds num-ber for the speciﬁc free-surface problem, and p i is thepressure at r i , which is governed by the equation ofstate p i = B [( ρ i /ρ ) γ − γ = B = c ρ /γ , ρ = / m , and c = p ( d p / d ρ ) | ρ . In this article dynamic boundary conditions [16] areused to represent solid boundaries. In this method,boundary particles obey the same dynamical equationsas the ﬂuid particles, but the integration update for posi-tion and velocity is not performed. Therefore, the den-sity and pressure of the particle varies, but the positionand veolicty remain ﬁxed (or change in an externallyimposed fashion in cases where solid boundaries aremoving.)

3. Single-GPU implementation

To describe our multi-GPU implementation, it is use-ful to review the main strategy behind the single-GPUversion, as well as other general considerations thatneed to be taken into account when performing SPHsimulations. For a more detailed description of thesingle-GPU version of DualSPHysics please refer to [7]and [8].Single-GPU DualSPHysics is a fully-on-GPU pro-gram, meaning that all the calculations, to be describedin detail below, are performed on the GPU, and data re-sides on the GPU memory at all times, and is copied tothe host only when saving of the simulation results isrequired. In this way the time consuming transfers ofdata between CPU and GPU are minimized. This fully-on-GPU strategy contrasts with other approaches whichonly partially port computations to the GPU [17].Some, but not all, of the tasks involved in the GPUimplementation of an SPH simulation are easily par-allelized. The di ﬃ culties mainly arise from the La-grangian nature of the method, where particles move inspace. The following basic strategies were followed toincrease performance: minimization of GPU-CPU datatransfers; optimization of memory usage to increasebandwidth; minimization of divergent warps; optimiza-tion of memory access patterns; and maximization ofoccupancy to hide latency.An SPH simulation consists of solving the set ofequations (7)-(9) at discrete points in time. To achievethis, a series of iterations is performed: SHAREDMEMORY SHAREDMEMORYSHAREDMEMORY SHAREDMEMORYSHAREDMEMORY SHAREDMEMORY

CACHE

GPU 1

DRAM

SHAREDMEMORY SHAREDMEMORY

MPI Communications

CACHE

DRAM

BRIDGE

DRAM CPU

CACHE

DRAM

GPU 3 GPU 4

CACHE

DRAM

GPU 2

BRIDGE

DRAM CPU

NODE 1 NODE 2

Figure 3: Illustration of a multi-GPU system with two nodes, eachhosting two GPUs. More generally, a multi-GPU system consists ofone or more computing nodes, each hosting one or more GPUs, inaddition to one or more CPUs. Di ﬀ erent nodes are connected via acomputer networking technology, and transfer of information is per-formed with MPI.

1. Find the neighbours of each particle, that is, allother particles within its support domain of 2 h

2. Sum the pairwise interactions between each parti-cle and its neighbours3. Update the system with the use of an integratorWe now describe the GPU implementation of thosesteps.

To determine the neighbours of each particle e ﬃ -ciently, the domain space is divided into cubic cells ofthe size of the kernel support domain, namely 2 h . Thena CUDA kernel is launched, with one thread per parti-cle, that assigns to each particle a label correspondingto the cell to which it belongs. The next step consistsof using the radix sort algorithm [18] to order the ar-ray holding the particle identiﬁcation labels, or IDs, ac-cording to the cell to which the particle belongs. Nextall arrays containing data information (position, veloc-ity, etc.) are ordered according to the newly ordered IDarray. Last, an extra array is generated with the indexof the ﬁrst particle of each cell that enables neighbouridentiﬁcation during the computation of the particle in-teractions. With the arrays re-ordered according to cells, and theindex of the ﬁrst particle of each cell, a new CUDA ker-nel is launched. One CUDA thread is assigned to each5article i to search for its neighbours. If particle i is incell c , the thread will look for neighbours only in cell c and in cells adjacent to c . This CUDA thread willalso compute the interaction between particle i and itsneighbours, that is, the sum on the right-hand side ofequations (7)-(9). Once the interactions are computed, and the right-hand side of equations (7)-(9) is known, the systemstate can be updated by numerical integration. As withany other set of simultaneous ﬁrst-order ordinarly dif-ferential equations, a variety of integration schemes canbe used [19], and here a second-order Verlet algorithmis used. The size of the integration time step variesthroughout each simulation according to the Courant-Friedrichs-Lewy (CFL) condition [20] and the magni-tude of the interactions. Here, minimum and maximumvalues of certain physical quantities need to be foundamong all particles, and for this, e ﬃ cient CUDA reduc-tion algorithms provided by NVIDIA are employed. Previous work concerning HPC for SPH publishedprior to the appearance of GPUs report results using ei-ther small CPU clusters or large supercomputers, suchas Blue-Gene. It is generally misleading to compare theperformance of a speciﬁc number of GPU cores to aCPU core as the architecture and potential performanceis quite di ﬀ erent. However, when demonstrating the ad-vantages of GPU computations for engineering appli-cations within industry, the ability to contrast the rela-tive runtimes can be both accessible and useful. Hence,results can be expressed in terms of speedup and e ﬃ -ciency by comparing the number of cores versus a sin-gle core. Therefore, to analyse the performance of oneGPU, the speedup in comparison with a single CPU coreis also shown to give an idea of the order of speedup thatis possible when using low-cost and accessible GPUcards, instead of large cluster machines.We have observed (see Ref. [8]) that the GPU versionof DualSPHysics runs on the order of 60 times fasterthan its single-threaded CPU version. For this com-parison, an NVIDIA GTX480 GPU and an Intel Corei7 were used. In the CPU code, all of the standardCPU optimizations were applied, symmetry in particle-particle interaction was employed, and SIMD instruc-tions that allow performing operations on data sets wereused whenever possible. However, if a multi-threaded,OpenMP approach is used on a multi-core CPU in-stead (4 cores of a CPU Intel Core i7, with 8 logical cores using Hyperthreading) the speedup of the GPUover the CPU was 14. In this way, for the mentionedhardware one can estimate a speedup of approximately60 / nc where nc is the number of CPU cores used.

4. Multi-GPU implementation

As mentioned in Section 3, single-GPU Dual-SPHysics has proved to be a viable option for accel-erated SPH simulations. However, for a fully-on-GPUapproach as presented here, there is a limitation on thenumber of particles that can be simulated due to thesize of the GPU memory. For example, a commonlyused NVIDIA GPU, the GTX480, with 1.5 GB of mem-ory, can handle up to about 8 million SPH particles,and an NVIDIA Tesla M2050 can simulate a maximumof about 15 million particles. To go beyond the lim-its imposed by those constraints, a spatial decomposi-tion scheme is introduced (illustrated in Fig. 1) with theuse of MPI to communicate between di ﬀ erent GPUs,which is explained in this section. To understand the computational tasks required fora multi-GPU SPH simulation, we begin with a briefdescription of our multi-GPU system. Fig. 3 shows aschematic view of one such system, consisting of twocomputing nodes, each hosting one CPU and two GPUs.More generally, a multi-GPU system consists of one ormore computing nodes, each hosting one or more GPUs,in addition to one or more CPUs. Nodes are connectedwith each other via a computer networking technology,and the transfer of data between nodes is executed usinga protocol, in this case MPI, whereas each GPU is con-nected to its host via a PCI Express bus. Throughout therest of this article, the words “CPU”, “node” and “host”are used interchangeably.

The main idea behind the implementation of ourscheme is to assign di ﬀ erent parts of the physical system An alternative solution to a limited size of GPU memory whendoing single-GPU simulations is to keep the particle information onthe host and transfer data to the GPU in batches, one after the other,until all particles have been processed. It is estimated that this pro-cedure would be quite ine ﬃ cient compared to a multi-GPU approachdue to long GPU to host data transfer times, and the fact that therewould be only one GPU processing the data of all particles. (cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:1)(cid:1) (cid:0)(cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) Sub−domain2 Sub−domainN

GPU 1 GPU 2 . . .

GPU N

Sub−domain1 (b)

EdgeHalos

GPU n+1GPU n Edge (a)

Figure 4: (a) One-dimensional domain decomposition scheme for acomputational system with N GPUs. The total physical volume isdivided into N sub-domains, each of which is assigned to a di ﬀ erentGPU. Data needs to transferred between GPUs when there is ﬂowfrom one sub-domain to another, and also when the halo needs to beupdated. (b) Data needed to process the dynamics of particles withinthe range of interaction (a distance of twice the smoothing length, 2 h ,in our case) is stored in the halo of the sub-domain. (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)

253 1467 8

Left edge Right edge Figure 5: The “edges” of a sub-volume are used to construct the haloof neighbouring sub-domains. The illustrated sub-domain is assumedto have neighbouring sub-domains both to the left and to the right, andtherefore data arrays containing particle state are ordered with the useof radix sort routine, as explained in detail in the text. to di ﬀ erent GPUs—that is, a volume domain decompo-sition technique is used. The portion assigned to a GPUis referred to as a “sub-domain” throughout the rest ofthis article. Eventually, the intention is to use a multi-dimensional domain decomposition with load balanc-ing. For the development of the algorithm presentedherein, a one-dimensional decomposition is su ﬃ cient.Fig. 1 illustrates a domain decomposition scheme in op-eration for a three-dimensional dam break simulationwith three obstacles. During and after each integrationstep, data needs to be transferred between GPUs dueto, ﬁrstly, particles migrating from one sub-domain toanother, and secondly, information of particles residingon a neighbour sub-domain that are close enough to theboundary to inﬂuence the dynamics of particles in thedomain of interest, that is, the ’halo building’ process.This requires three main steps: preparation of the data,as well as GPU-CPU communications, and inter-CPUcommunications. Preparation of data is the process bywhich a GPU re-arranges the information to be passedto other GPUs before actually sending it. An example ofthis is the re-ordering of the arrays containing the stateof particles when they step out of their sub-domain, sothat the relevant information is e ﬃ ciently packaged be-fore it is copied to the CPU (host) memory.Fig. 4a illustrates a one-dimensional domain decom-position scheme for a computational system with N GPUs. The total physical volume is divided accord-ingly into N sub-domains (boundaries are shown inred in Fig. 4), each of which is assigned to a di ﬀ erentGPU. Fig. 4b shows two neighbouring domains, n , and7 S peedup Number of GPUs5 million particlesIdealGPUs per host 4GPUs per host 1

Figure 6: Strong scaling behaviour for the speedup of a dam breaksimulation with 5 million particles. The red curve corresponds to sim-ulations run on the system with up to four GPUs, all residing on thesame node, whereas the green curve shows results using one GPU pernode. n +

1. Data from the edge (the portion of the sub-domainwithin a distance 2 h to the sub-domain boundary) of do-main n is used to construct the halo of the neighbouringdomain n +

1, and vice versa. The width of the haloand the width of the edge are the size of the interactionrange, in our case, twice the smoothing length, 2 h .For the sake of clarity, in the rest of this sub-sectionwe assume a muti-GPU system with N nodes. Eachnode is labelled with the number k (with k = , , . . . N )and hosts only one CPU (CPU k ) and one GPU (GPU k ).There is, then, one MPI process, p k , per GPU, and p k controls GPU k , which is assigned to sub-domain S k . Af-ter p k sets the initial state Q k (position, velocity, etc.) ofparticles in S k and its halo H k , the program iterates overthe following steps: • Determine integration step size, ∆ t • GPU k : update state, Q k • GPU k : if any particles now have positions corre-sponding to a di ﬀ erent sub-domain S j , with j , k ,then – GPU k : label particles according to sub-domain – GPU k : order arrays according to sub-domainlabel (using radix sort) – GPU k : copy data, D send , from GPU k to CPU k ,of particles migrating from one sub-domainto another – CPU k : send D send to p j , with j , k E ff i c i en cy Number of GPUs5 million particlesIdealGPUs per host 4GPUs per host 1

Figure 7: E ﬃ ciency for the simulations described in Fig. 6. – CPU k : receive data, D receive , from p j , with j , k – GPU k : copy D receive from CPU k to GPU k – GPU k : update number of particles in sub-domain k • if saving data, then – GPU k : copy data of all particles in sub-domain from GPU k to CPU k • GPU k : label particles according to edge • GPU k : order arrays according to edge label (usingradix sort) • GPU k : copy data, E send , of edge particles, fromGPU k to CPU k • CPU k : send E send to p j , with j , k • CPU k : receive data, E receive , from p j , with j , k • GPU k : copy E receive from CPU k to GPU k • GPU k : update number of particles in sub-domain k • GPU k : order arrays according to cubic cell of size2 h Determining the size of ∆ t , which is the same for allsub-domains, involves two kinds of reduction opera-tions: one on the GPU, where each device GPU k deter-mines the minimum size of ∆ t k suitable for the numeri-cal integration for particles in its sub-domain, and thenanother, on the CPUs, whereby the minimum among all ∆ t k ’s is found with an MPI Allreduce operation.In summary, the multi-GPU program iterates over thenext three main steps:8. Update particle state: Each GPU updates the stateof the particles within the sub-domain assigned toit. This involves (i) re-ordering data arrays thathold the state of the particles to determine e ﬃ -ciently their neighbours (using the radix sort al-gorithm), (ii) computing the inter-particle interac-tions (e.g., the forces), and (iii) the actual updateof the particle state.2. Update sub-domains: the number of particles ineach sub-domain must be updated to reﬂect par-ticle migration due to ﬂuid ﬂow, and the state ofmigrated particles must be transferred accordingly.This is achieved by re-ordering data arrays (us-ing the radix sort algorithm) according to the sub-domain in which they are located, and then trans-ferring the necessary data from the GPU to theCPU , then among CPUs if the GPUs reside ondi ﬀ erent hosts, and last, from the CPU to the newGPU.3. Update sub-domain halo: The halo holds the dataof particles that exist on a neighbouring sub-domain but close enough to inﬂuence the parti-cles of the domain in question (a distance twicethe smoothing length h in our case, see Fig. 4b).Here again the radix sort algorithm is used to re-order data arrays according to the halo to whichthey they may belong.In its current version, the scheme does not includetask overlapping of computation on each GPU withcommunications among those devices. Since the com-putation of the particles inside a given sub-domain re-quires an up-to-date state of the particles within its halo,it is not possible for a given GPU to send data to anotherdevice before the computation is ﬁnished, which pre-vents the overlap of computation and communication.In regards to overlapping of CPU and GPU compu-tation, the highly parallelizable nature of the SPH algo-rithm, in conjunction with the high cost of GPU-CPUcommunications, makes fully-on-GPU schemes moree ﬃ cient, at least in single-GPU computations. This hasbeen discussed in Ref. [17] (e.g., see Figure 1 of thatwork). However, it is conceivable that in multi-GPUapproaches, due to the fact that data needs to be trans-ferred among GPUs via the CPU , one could performsome of the computations on the CPU, thereby savingGPU-to-CPU communication time. This idea remainsto be explored. At the time of writing, May 2011, CUDA 4.0 Release Candidatehas become available, and the production version is expected soon. E ff i c i en cy l o ss , and % o f c o m pu t a t i on t i m e Number of GPUs7 million particlesEfficiency lossCombined overheadData preparationInter-CPU commGPU-CPU comm

Figure 8: E ﬃ ciency loss is compared to the percentage of time con-sumed by various computational tasks in a multi-GPU simulation. To illustrate the use of the radix sort algorithm for thehalo updating procedure, we use Fig. 5. It is assumedthat the sub-domain represented in this ﬁgure has neigh-bour sub-domains both to the left and to the right. Par-ticles labelled 2, 5, and 3 are within a distance 2 h fromthe sub-domain boundary to the left and therefore formthe left “edge”, whereas particle labelled 8 is within thisdistance to the sub-domain boundary to the right, andforms the right “edge”. Particles 1, 4, 6, and 7 are fur-ther than 2 h from either boundary and therefore do notbelong to any edge. When the update of the halo of thesub-domain neighbouring to the left is made, the state ofparticles on the left edge needs to be passed to the GPUassigned to it. To achieve this, ﬁrst a CUDA kernel islaunched which assigns to particles not in a halo the la-bel “0”, and to particles on the left halo a label “1”, andto particles in the right halo the label “2”: halo = { , , , , , , , } , id = { , , , , , , , } , where “halo” and “id” are one-dimensional data arraysof integers, containing the assigned label, and the parti-cle identiﬁcation number, respectively.Next the radix sort routine is called to obtain a re-ordered array of the particles identiﬁcation numbers, ac- CUDA 4.0 allows direct communications among multiple GPUs, andMPI transfers directly from the GPU memory without an intermedi-ate copy to system (CPU) memory. However, this capability is onlyavailable to NVIDIA’s Tesla GPUs with the Fermi architecture (e.g.,Tesla M2050) and GPUs must reside on the same node. halo = { , , , , , , , } , id = { , , , , , , , } . The new version of the array ID is used to re-order all ofthe arrays containing the state of the particles, namely,the one dimensional array for positions, velocities, etc.Once this reordering is made, the CUDA function cud-aMemcpy is used to transfer the relevant data from theGPU to the CPU, and, if the GPUs reside on di ﬀ erenthosts, then an inter-node communication is made beforeuploading the data in question onto the desired GPU.A similar procedure to the construction of the halo isused for the data transfers due to particle migration. Inthis case, particles are assigned labels depending on thesub-domain to which they have migrated: “1” if theymigrated to left, “2” if they migrated to the right, and“0” otherwise, and data is transferred accordingly afterreordering the data arrays containing the particle state,similar to the construction of the halo. There are three main features which make our schemedi ﬀ erent to other multi-GPU approaches for SPH (forexample, [17]):1. the use a low-level CUDA approach instead of ahigh-level, directive-based transformation of CPUto GPU code;2. the full implementation of the computations on theGPU, instead of porting to the GPU only part ofthe computations, and performing the rest on theCPU; and3. the fact that the code development starts from asingle-GPU code and progresses towards a multi-GPU application, instead of going from a multi-CPU to a multi-GPU program.The use of a low-level approach, combined with thefact that the computation is fully implemented on theGPU [points (1) and (2)], facilitates a higher level ofcontrol of the hardware, the utilization of the mem-ory hierarchies, as well as the use of highly optimizedCUDA functions. These features are not as readilyavailable in directive-based approaches. Additionally—and in regards to multi-GPU programming— the ad-vantage of these two features is the straightforwardavailability of both data and optimized CUDA sub-routines to process it, for instance, during the construc-tion of the halo and the migration of particles, which arecrucial steps in the scheme. The advantage of point (3) is related to the previoustwo points, where the data structures and tools for themulti-GPU scheme were present and thoroughly testedon the single-GPU version before making the extensionto multi-GPU. For instance, the routines for sorting anddetermining the labels of the ﬁrst and last particle in agiven region of space were already used in single-GPUmethod for ﬁnding neighbours of particles, and whenimplementing the multi-GPU version they were re-usedfor determining migrating particles as well as the edgesand halos in a sub-domain, as explained in detail above.

5. Hardware

Simulations have been performed on two di ﬀ erentsystems: one is part of a dedicated cluster at the Uni-versity of Manchester, where the GPU section has sixnodes each hosting two NVIDIA Tesla M2050, whichfeature the Fermi architecture, and each possess 448computing cores grouped in 14 streaming multiproces-sors of 32 cores each, clock rate of 1.15 GHz, 3 GBGDDR5 memory, and a high speed, PCI-Express bus,Gen 2.0 Data Transfer. The hosting nodes are con-nected via 1 Gigabit per second Ethernet technology,which, unfortunately, is much slower than Inﬁniband,with its current speed of 40 Gigabits per second and sig-niﬁcantly lower latency, to which we currently have noaccess. Each of the hosting nodes also has two six-coreAMD Opteron processors (CPUs), and runs ScientiﬁcLinux release 5.5.The second system used for simulations is a sin-gle node hosting four GPUs at the University of Vigo,Spain. The GPUs are NVIDIA GTX480, which, likethe M2050, feature a Fermi architecture and 448 com-puting cores, but have a faster clock rate of 1.4 GHz, anda smaller memory of 1.5 GB GDDR5. The connectionto the host is done via a PCI-Express bus. Additionally,the hosting node has two Intel Xeon E5620 2.4 GHzCPUs, and since all four GPUs reside on the same host,there is no need for a computer networking technology.

6. Results

To test our multi-GPU program, simulations of athree-dimensional dam break with three obstacles, illus-trated in Fig. 1 were performed. When assessing paral-lelization, it is useful to address questions of e ﬃ ciencyand scalability in order to determine the usefulness ofthe approach used. The aim of this section is to answerthe following questions:10 igure 9: Percentage of time that each MPI process uses for inter-CPUcommunications for simulations using di ﬀ erent number of GPUs. Thehorizontal axis corresponds to the label of the MPI process. Smallertimes correspond to the processes assigned to endpoints of the simu-lation box, as explained in more detail in the text. • Scaling: how do computing times change when thesystem size and the number of computing devices(in our case, the number of GPUs) vary? • Where are the bottlenecks? Are they in the GPUto CPU or in inter-CPU data transfers? Or are theyon the data preparation routines? • How does the overhead resulting from data trans-fers between GPUs (due to particle migration andhalo construction) compare to the time spent byeach GPU processing the motion of particles in itsassigned sub-domain?The speedup s of the multi-GPU simulations is de-ﬁned as a function of the number of GPUs N in the fol-lowing way: s ( N ) = T ( N ref ) n p ( N ) T ( N ) n p ( N ref ) (10)where T is the simulation time divided by the numberof steps N steps , N ref is the number of GPUs used asreference (in this article, N ref = n p is a measureof the system size, which in this article is the numberof particles. The e ﬃ ciency η is deﬁned as the speedupdivided by the number of GPUs: η ( N ) = s ( N ) N . (11)Two types of scaling behaviour are investigated, weak and strong , which are now described. The simulation time per step T is used instead of the simulationtime T sim because N steps varies with n p for simulations of a ﬁxed phys- Figure 10: Comparison of the percentage of the computation timespent by di ﬀ erent tasks carried out during a 4-GPU simulation of adam break, using 7 million particles, where each GPU resides on adi ﬀ erent computing node. Strong scaling is studied by increasing the number ofGPUs while leaving the system size (both the numberof particles and the physical dimensions) ﬁxed n p ( N ) = n p ( N ref ). Substituting this in (10) we obtain s ( N ) = T ( N ref ) T ( N ) . (12)In the ideal case the computation time is inversely pro-portional to the number of GPUs used, T ( N ) = c / N ,where c is the proportionality constant, giving s ( N ) = N / N ref . If the reference number of GPUs N ref =

1, then s ( N ) = N .Fig. 6 shows results for the strong scaling behaviourof our program for the speedup, for a dam break simula-tion of 5 million particles, on the two systems describedin Section 5. 5 million particles was chosen becausethis number is close to the maximum number of parti-cles that can currently ﬁt into a single NVIDIA GTX480 ical time t , and ﬁxed physical system dimensions. The reason for thisis that N steps = t / ∆ t , where ∆ t is the average integration step (aver-age over the whole simulation, since an adaptive time step algorithmis used), and ∆ t is typically proportional to the discretization length ∆ p (the “size” of the SPH particle), which determines the numberof particles n p ∝ / ∆ p . Therefore, for ﬁxed physical time t , it isexpected that N steps ∝ n / p . Since the simulation time should be pro-portional to both the number of particles and to the number of steps( T sim ∝ N steps n p ), simulation times will follow T sim ∝ n / p . Thisbehaviour has been observed in our simulations. ﬃ ciency for thesystem described in Fig. 6.In Fig. 8, the e ﬃ ciency loss, deﬁned as 100 − η (where η is the e ﬃ ciency, plotted in Fig. 7), is compared to thepercentage of time consumed by various computationaltasks in a multi-GPU simulation. From this, it can beobserved that (i) a correlation exists between the com-bined overhead (sum of inter-CPU, GPU-CPU commu-nication and data preparation times, in green circles)and the loss of e ﬃ ciency (in red squares), and (ii) thatthe overhead increases with the number of GPUs mostlydue to the inter-CPU communications (black triangles),since the overhead caused by GPU-CPU communica-tions (red diamonds) and data preparation (blue trian-gles) increases much more slowly. Note that, since eachMPI process in a given simulation measures its owncommunication times, and those times are in generaldi ﬀ erent from one process to another, here the longesttime among all processes of a given simulation is used.This will be further discussed for Fig. 9 below.In future versions of our program, the total over-head is expected to be reduced with (a) the use ofpinned memory for faster the GPU-CPU data transfers,and (b) the utilization of an Inﬁniband switch, whichshould substantially decrease the inter-CPU communi-cations. Also, for the case of GPUs residing on the samehost, the introduction of CUDA 4.0 should result in fur-ther reduction of overhead. As mentioned earlier, two-dimensional domain decomposition is envisaged, whichwill entail a signiﬁcant increase of the inter-CPU com-munication times, especially on systems that use slowinter-CPU networking technology, such as Gigabit Eth-ernet, instead of Inﬁniband or faster versions of Eth-ernet. The reason for this is that for two-dimensionaldomain decomposition, each MPI process will need tosend information to as many as eight other processeswhen doing the inter-CPU communications on systemwith many nodes.Figure 9 shows the percentage of time that each MPIprocess uses for inter-CPU communications for 2-, 3-, 4-, 5-, and 6-GPU simulations. The horizontal axiscorresponds to the label of the MPI process. Each la-bel is assigned in order of increasing y -coordinate to the’slices’ into which the simulation box has been subdi-vided (e.g., Fig. 1 for the particular case of three GPUs).One can see that the time is smaller for the processes S peedup Number of GPUsWeak scalingIdeal8m 4m 1m 100k

Figure 11: Simulations using a number of particles proportional to thenumber of GPUs. Plots are shown for n sub (number of particles perGPU) equal 100 thousand, 1, 2, 4, and 8 million. assigned to sub-volumes on the edges of the simulationbox than for the rest, reﬂecting the fact that processeson the edge need to send data to one neighbour processonly, whereas the rest of them need to send data to twoneighbours instead .In Fig. 10 some of the most consuming tasks of amulti-GPU simulation are compared for a dam breakwith 7 million particles on four GPUs. This ﬁgureshows that the time that each GPU uses to compute theparticle dynamics (once its halo has been updated) isnearly 80% of the total simulation time, dominating theoverall computation. However, considerable time is alsospent on inter-CPU communications, and, to a lesser ex-tent, data preparation, and GPU-CPU data transfers. Simulations have also been performed using systemsizes proportional to the number of GPUs : n p ( N ) = Nc ,where c is the proportionality constant. Substituting thisin (10) we get s ( N ) = T ( N ref ) NT ( N ) N ref . (13)In the ideal case the time is the same regardless of thenumber of GPUs, T ( N ) = T ( N ref ) and s ( N ) = N / N ref . We have observed that CPU data preparation times are alsoshorter on processes assigned to simulation box edges than on therest. GPU data preparation times are, on the other hand, ﬂat acrossall processes of a given simulation, which is also consistent with theway in which the GPU prepares data. Data preparation times in Fig. 8are the sum of GPU and CPU preparation times of the highest time-consuming process. igure 12: (a) Domain decomposition scheme for a ﬁxed number of particles n sub per GPU, and an increasing number of GPUs, corresponding toweak scaling. (b) For a given number of GPUs (three in this example), one can show that the ratio of particles in the halo n halo to n sub decreases,leading to the better scaling that can be seen in Fig. 11. See text for details.

13f the reference number of GPUs N ref =

1, then s ( N ) = N .Fig. 11 shows the results of simulations using a num-ber of particles proportional to the number of GPUs.Plots are shown for 100 thousand, 1, 2, 4, and 8 mil-lion particles per GPU, with the largest simulation sofar being for 32 million particles distributed among fourGPUs. As expected, scaling improves with a growingnumber of particles per GPU because the time spenton tasks that result in an overhead becomes a smallerpercentage of the total computational time, which, asshown in Fig. 10, is dominated by the update of particlestates performed by each GPU.One must note that, for each curve in Fig. 11, the totalnumber of particles in the system is increased by makinga ﬁner discretization of the continuum, that is, by usingsmaller particles (smaller smoothing length h ), insteadof increasing the physical dimensions of the simulationbox (here this is kept ﬁxed). In this scenario, one candemonstrate that the larger the number n sub of particlesper GPU, the smaller the fraction n halo / n sub , which leadsto better scaling.To demonstrate this, consider the ratio n halo n sub ∼ LH hLHW / N = N hW , (14)where L , H , and W are the length, height and width ofthe simulation box, LH h is the volume of the halo, and LHW is the total domain of the simulation box, which,divided by N , gives the volume of the sub-domain.Since the volume of one particle is proportional to h ,for a ﬁxed ﬂuid volume the total number of particles ina simulation is given by n p = Nn sub ∼ h − . We cantherefore write: n halo n sub ∼ N / n / sub . (15)This means that, for a ﬁxed number of GPUs N , thenumber of particles in the halo, relative to the number ofparticles in the sub-domain, shrinks. This is illustratedin Fig. 12. In part (a), a ﬁxed number of particles perGPU with an increasing number of GPUs is shown, cor-responding to weak scaling. One can observe the reduc-tion of the size of the halo, which is of size 2 h . Part (b)of this ﬁgure shows also decreasing halo size as num-ber of particles is increased while keeping the numberof GPUs ﬁxed. It is instructive to compare the scaling results pre-sented in this article with those of other multi-GPU and multi-CPU approaches. Oger et al [17] describe a multi-GPU SPH scheme that uses directive-based, partially-ported GPU code. Although that report does not presentthe same kind of scaling as we do here, one can extractsome useful information from their data. In particular,from their Figure 9, one can obtain an approximationto their strong scaling behaviour by dividing the timetheir simulation takes on two nodes by the time it takeson four nodes, which in the ideal scaling case wouldyield 2. In Ref. [17] this value is approximately 1.7 (for155,500 particles) and 1.8 (for 2,474,000 particles). Thesame division using our data for ﬁve million particlesgives approximately 1.7, showing consistency betweenthe two schemes.A comparison with multi-CPU code is also interest-ing. For example, Maruzewski et al. [1] report bothstrong and weak scaling on SPH simulations on as manyas 1024 CPU cores, with a number of particles of up to124 million. As it is often the case in this type of multi-CPU simulations, scaling data (Figures 5, 6, and 7 inRef. [1]) shows no signiﬁcant departure from the idealvalue until the number of CPUs is relatively large, onthe order of eight in this case.A plausible explanation for the worse scaling be-haviour on multi-GPU systems than on multi-CPU clus-ters has been proposed in Ref. [21] for Molecular Dy-namics simulations. In essence, the part of the pro-gram computing the motion of particles is acceleratedby the GPU, but the inter-device communications arenot. Therefore, the time spent on the latter, as a fractionof the total simulation time, becomes relatively largemore quickly as the number of computing devices is in-creased.

7. Summary and future work

This paper presented a computational methodol-ogy to carry out three-dimensional, massively parallelSmoothed Particle Hydrodynamics (SPH) simulationsacross multiple GPUs. This has been achieved by intro-ducing a spatial domain decomposition scheme into thesingle-GPU part of the DualSPHysics code, convertingit into a multi-GPU SPH program.By using multiple GPUs, on the one hand, simula-tions can be performed that could also be performedon a single GPU, but can now be obtained even fasterthan on one of these devices alone, leading to speedupsof several hundred in comparison to a single-threaded14PU program. On the other hand accelerated simula-tions with tens of millions of particles can now be per-formed, which would be impossible to ﬁt on a singleGPU due to memory constraints in a full-on-GPU sim-ulation where all data resides on the GPU. By being ableto simulate—without the need of large, expensive clus-ters of CPUs—this large number of particles at speedswell beyond one hundred times faster than single CPUprograms, our software has the potential to bypass lim-itations of system size and simulation times which havebeen constraining the applicability of SPH to variousengineering problems.The methodology features the use of MPI routinesand the sorting algorithm radix sort, for the migration ofparticles between GPUs as well as domain ’halo’ build-ing. A study of weak and strong scaling with a slow Eth-ernet connection shows that inter-CPU communicationsare likely to be the bottleneck of our simulations, butconsiderable overhead is also produced by data prepa-ration, and, to a lesser extent, by GPU-CPU data trans-fers. A possible solution to the overhead caused by thelatter is the use of pinned memory, which so far in ourprogram remains unused. The use of Inﬁniband insteadof Ethernet should reduce the overhead cause by inter-CPU communications, and for the case of GPUs resid-ing on the same host, the use of the recently releasedCUDA 4.0 will be introduced, which should further ac-celerate communications. Future work includes alsothe introduction of a dynamic load balancing algorithm,a multi-dimensional domain decomposition scheme, aswell as ﬂoating body capabilities.

Acknowledgements

The authors gratefully acknowledge the support ofEPSRC EP / H003045 / References [1] P. Maruzewski, D. Le Touz´e, G. Oger, F. Avellan,SPH High-Performance Computing simulations of rigid solids impacting the free-surface of water, Journal of Hy-draulic Research 48 (Extra Issue (2010)) (2010) 126–134. doi:10.3826/jhr.2009.0011 .[2] A. Ferrari, M. Dumbser, E. F. Toro, A. Armanini, Anew 3d parallel sph scheme for free surface ﬂows,Computers and Fluids 38 (6) (2009) 1203 – 1217. doi:DOI:10.1016/j.compfluid.2008.11.012 .[3] R. Spurzem, P. Berczik, G. Marcus, A. Kugel, G. Lienhart,I. Berentzen, R. Mnner, R. Klessen, R. Banerjee, Acceleratingastrophysical particle simulations with programmable hardware(FPGA and GPU), Computer Science Research and Develop-ment 23 (3-4) (2009) 231–239.[4] T. Amada, M. Imura, Y. Yasumuro, Y. Manabe, K. Chihara,Particle-based ﬂuid simulation on GPU, in: ACM Workshop onGeneral-Purpose Computing on Graphics Processors and SIG-GRAPH, 2004.[5] T. Harada, S. Koshizuka, Y. Kawaguchi, Smoothed particle hy-drodynamics on GPUs, Proceedings of Computer Graphics In-ternational (2007).[6] A. H´erault, G. Bilotta, R. A. Dalrymple, SPH on GPU withCUDA, Journal of Hydraulic Research 48 (1 supp 1) (2010)0022–1686. doi:10.1080/00221686.2010.9641247 .[7] http://dual.sphysics.org .[8] A. J. C. Crespo, J. Dom´ınguez, A. Barreiro, M. G´omez-Gesteira, B. D. Rogers, GPUs, a new tool of ac-celeration in cfd: E ﬃ ciency and reliability onsmoothed particle hydrodynamics methods, PLoSONE doi:doi:10.1371/journal.pone.002068 .[9] .[10] S. McIntosh-Smith, T. Wilson, J. Crisp, A. A. Ibarra, R. B. Ses-sions, Energy-aware metrics for benchmarking heterogeneoussystems, SIGMETRICS Perform. Eval. Rev. 38 (2011) 88–94.[11] J. C. Campbell, R. Vignjevic, M. Patel, S. Milisavljevic, Simu-lation of Water Loading On Deformable Structures Using SPH,CMES-COMPUTER MODELING IN ENGINEERING & SCI-ENCES 49 (1) (2009) 1–21.[12] M. G´omez-Gesteira, B. D. Rogers, R. A. Dalrymple, A. Crespo,State-of-the-art of classical sph for free-surface ﬂows, Journalof Hydraulic Research 48 (2010) 6–27.[13] G. Liu, M.B.Liu, Smoothed Particle Hydrodynamics, a mesh-free particle method, World Scientiﬁc, 2003.[14] J. J. Monaghan, Why particle methods work, SIAM Journal onscientiﬁc and statistical computing 3 (1982) 422–433.[15] D. M. Acreman, T. J. Harries, D. A. Rundle, Modelling circum-stellar discs with three-dimensional radiation hydrodynamics,Monthly Notices of the Royal Astronomical Society 403 (2010)1143–1155. arXiv:0912.2030 .[16] A. Crespo, M. G´omez-Gesteira, R. A. Dalrymple, Boundaryconditions generated by dynamic particles in sph methods,Computers, materials and continua 5(3) (2007) 173184.[17] G.Oger, E. Jacquin, M. Doring, P.-M. Guilcher, R. Dolbeau, P.-L. Cabelguen, L. Bertaux, D. L. Touz´e, B. Alessandrini, HybridCPU-GPU acceleration of the 3-D parallel code sph-ﬂow, in:Proc 5th International SPHERIC Workshop, Ed. B.D. Rogers,2010, pp. 394–400.[18] N. Satish, M. Harris, M. Garland, Designing e ﬃ cient sortingalgorithms for manycore GPUs, in: Proceedings of IEEE Inter-national Parallel and Distributed Processing Symposium 2009,2009.[19] W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery,Numerical recipes in C (2nd ed.): the art of scientiﬁc computing,Cambridge University Press, New York, NY, USA, 1992.[20] M. G´omez-Gesteirra, B. D. Rogers, R. A. Dalrymple, A. Cre-spo, M. Narayanaswamy, SPHysics - development of a free-surface ﬂuid solver- Part 1: Theory and Formulations, Comput- rs and Geosciences, doi:10.1016/j.cageo.2012.02.029 .[21] C. R. Trott, L. Winterfeld, P. S. Crozier, General-Purposemolecular dynamics simulations on GPU-based clusters (2011).arXiv:1009.4330v2 [cond-mat.mtrl-sci]..[21] C. R. Trott, L. Winterfeld, P. S. Crozier, General-Purposemolecular dynamics simulations on GPU-based clusters (2011).arXiv:1009.4330v2 [cond-mat.mtrl-sci].