[PDF] A Smoothed Particle Hydrodynamics Mini-App for Exascale

Abstract

The Smoothed Particles Hydrodynamics (SPH) is a particle-based, meshfree, Lagrangian method used to simulate multidimensional fluids with arbitrary geometries, most commonly employed in astrophysics, cosmology, and computational fluid-dynamics (CFD). It is expected that these computationally-demanding numerical simulations will significantly benefit from the up-and-coming Exascale computing infrastructures, that will perform 10 18 FLOP/s. In this work, we review the status of a novel SPH-EXA mini-app, which is the result of an interdisciplinary co-design project between the fields of astrophysics, fluid dynamics and computer science, whose goal is to enable SPH simulations to run on Exascale systems. The SPH-EXA mini-app merges the main characteristics of three state-of-the-art parent SPH codes (namely ChaNGa, SPH-flow, SPHYNX) with state-of-the-art (parallel) programming, optimization, and parallelization methods. The proposed SPH-EXA mini-app is a C++14 lightweight and flexible header-only code with no external software dependencies. Parallelism is expressed via multiple programming models, which can be chosen at compilation time with or without accelerator support, for a hybrid process+thread+accelerator configuration. Strong and weak-scaling experiments on a production supercomputer show that the SPH-EXA mini-app can be efficiently executed with up 267 million particles and up to 65 billion particles in total on 2,048 hybrid CPU-GPU nodes.

Full PDF

AA Smoothed Particle Hydrodynamics Mini-Appfor Exascale

Aurélien Cavelan

University of Basel, [email protected]

Rubén M. Cabezón

University of [email protected]

Michal Grabarczyk

University of [email protected]

Florina M. Ciorba

University of Basel, Switzerlandﬂ[email protected]

May 7, 2020

Abstract

The Smoothed Particles Hydrodynamics (SPH) is a particle-based,meshfree, Lagrangian method used to simulate multidimensional ﬂuidswith arbitrary geometries, most commonly employed in astrophysics, cos-mology, and computational ﬂuid-dynamics (CFD). It is expected thatthese computationally-demanding numerical simulations will signiﬁcantlybeneﬁt from the up-and-coming Exascale computing infrastructures, thatwill perform FLOP/s. In this work, we review the status of a novelSPH-EXA mini-app, which is the result of an interdisciplinary co-designproject between the ﬁelds of astrophysics, ﬂuid dynamics and computerscience, whose goal is to enable SPH simulations to run on Exascale sys-tems. The SPH-EXA mini-app merges the main characteristics of threestate-of-the-art parent SPH codes (namely ChaNGa, SPH-ﬂow, SPH-YNX) with state-of-the-art (parallel) programming, optimization, andparallelization methods. The proposed SPH-EXA mini-app is a C++14lightweight and ﬂexible header-only code with no external software de-pendencies. Parallelism is expressed via multiple programming models,which can be chosen at compilation time with or without accelerator sup-port, for a hybrid process+thread+accelerator conﬁguration. Strong- andweak-scaling experiments on a production supercomputer show that theSPH-EXA mini-app can be eﬃciently executed with up 267 million par-ticles and up to 65 billion particles in total on 2,048 hybrid CPU-GPUnodes. a r X i v : . [ c s . C E ] M a y Introduction

The Smoothed Particle Hydrodynamics is a commonly used method to simulatethe mechanics of continuum media. This method is a pure Lagrangian technique,particle-based, meshfree, and adequate for simulating highly distorted geome-tries and very dynamical scenarios, while conserving momentum and energyby construction. As such, it has been used in many diﬀerent ﬁelds, includingcomputational ﬂuid dynamics, solid mechanics, engineering, nuclear fusion, as-trophysics, and cosmology. In this work we present a review of the status of anSPH mini-app designed for Exascale.In most cases, actual SPH code implementations initially target a speciﬁcsimulation scenario (or a subset). This means that in many cases these hy-drodynamics codes are implemented with the physics needed to solve a speciﬁcproblem in mind, while the parallelization, scaling, and resilience take a rele-vant role only later. This importance shift (or rebalancing) becomes relevantwhen addressing much larger problems that require much more computationalresources, or when the computing infrastructures change or evolve. The philoso-phy behind the design of our SPH-EXA mini-app reﬂects the opposite. Knowingthat the SPH technique is so ubiquitous, we developed the mini-app targetingthe emerging Exascale infrastructures. We took into account the state-of-the-artSPH methodology, learnt from current production SPH codes, and implementedit having in mind the design characteristics for performance that would be de-sirable in a code that could potentially reach sustained ExaFLOP/s, namelyscalability, adaptability, and fault tolerance. To reach this goal, we employedstate-of-the-art computer science methodologies, explored diﬀerent paralleliza-tion paradigms and their combinations, and used solid software developmentpractices. All this within a close multi-directional interdisciplinary collaborationbetween scientists from ﬂuid dynamics, astrophysics, cosmology, and computerscience.Lighter than production codes, mini-apps are algorithm-oriented and alloweasy modiﬁcations and experiments [8]. Indeed, a single function can be evalu-ated using diﬀerent strategies leading to diﬀerent performance results, even if thephysical result is unchanged. These evaluation strategies may rely on vectoriza-tion, node level multi-threading, or cross-node parallelism. Their eﬃciency alsodepends on platform conﬁguration: presence of accelerators, generation of CPU,interconnection network fabric and topology, and others. Therefore, a mini-appis perfectly suitable as a portable code sandbox to optimize a numerical method,such as the SPH method, for Exascale.The SPH-EXA mini-app is a C++14 lightweight and ﬂexible header-onlycode with no external software dependencies, that works by default with doubleprecision data types. Parallelism is expressed via multiple programming models,which can be chosen at compilation time with or without accelerator support, fora hybrid node-core-accelerator conﬁguration: MPI+OpenMP+OpenACC|OpenMPTarget Oﬄoading|CUDA. The SPH-EXA mini-app can be compiled with theGCC, Clang, PGI, Intel, and Cray (as long as the given compiler supports thechosen programming model) C/C++ compilers. The code is open-source andis freely available on GitHub under the MIT license at: https://github.com/xxx/yyy .Weak-scaling experiments on a production supercomputer have shown that2he SPH-EXA mini-app can execute with up to 65 billion particles on 2,048hybrid CPU-GPU nodes at 67% eﬃciency. These results are, therefore, verypromising especially given that the eﬃciency of the code decreased very littlewhen scaling from 512 to 2,048 nodes (see Section 5.2). Similar eﬃciency resultsare also reported in the strong-scaling results included in Figure 8.This paper is organized as follows. Section 2 presents a short overview ofother representative mini-apps from scientiﬁc computing and describes the dif-ferences compared to skeleton applications. Section 3 is concentrated on theco-design aspects of this work. Section 4 includes all the details related tothe present implementation of the SPH-EXA mini-app, including the imple-mented SPH version, the details on the employed parallel models and domaindecomposition. Section 5 presents the results obtained regarding validation andveriﬁcation, and the scaling experiments conducted. Finally, Section 6 discussesthe next steps for the project and Section 7 presents the conclusions. Mini-apps or proxy-apps have received great attention in recent years, withseveral projects being developed. The Mantevo Suite [9] devised at SandiaNational Laboratory for high performance computing (HPC) is one of the ﬁrstlarge mini-app sets. It includes mini-apps that represent the performance ofﬁnite-element codes, molecular dynamics, and contact detection, to name afew.Another example is CGPOP [12], a mini-app from oceanography, that imple-ments a conjugate gradient solver to represent the bottleneck of the full ParallelOcean Program application. CGPOP is used for experimenting with new pro-gramming models and to ensure performance portability.At Los Alamos National Laboratory, MCMini [13] was developed as a co-designapplication for Exascale research. MCMini implements Monte Carlo neutrontransport in OpenCL and targets accelerators and coprocessor technologies.The CESAR Proxy-apps [14] represent a collection of mini-apps belonging tothree main ﬁelds: thermal hydraulics for ﬂuid codes, neutronics for neutronicscodes, and coupling and data analytics for data-intensive tasks.The European ESCAPE project [15] deﬁnes and encapsulates the funda-mental building blocks (‘Weather & Climate Dwarfs’) that underlie weatherand climate services. This serves as a prerequisite for any subsequent co-design,optimization, and adaptation eﬀorts. One of the ESCAPE outcomes is At-las [16], a library for numerical weather prediction and climate modeling, withthe primary goal of exploiting the emerging hardware architectures forecasted tobe available in the next few decades. Interoperability across the variegated solu-tions that the hardware landscape oﬀers is a key factor for an eﬃcient softwareand hardware co-design [17], thus of great importance when targeting Exascalesystems.In the context of parallel programming models, research has been focusingon the eﬃcient use of intra-node parallelism, able to properly exploit the under-lying communication system through a ﬁne grain task-based approach, ranging The largest cosmological SPH simulation to date performed with the parent codes is ofthe order of 25 billion particles [2]. Although, comparison is not fair due to diﬀerent levels ofphysics included, it gives an order of magnitude reference. build and execution system simple, to not discourage potential users.The building should be kept as simple as a Makeﬁle and the preparation of anexecution run to a handful of command line arguments: “ if more than this levelof complexity seems to be required, it is possible that the resulting MiniApp itselfis too complex to be human-parseable, reducing its usefulness. " [18].The present work introduces the interdisciplinary co-design of an SPH-EXAmini-app with three parent SPH codes originating in the astrophysics academiccommunity and the industrial CFD community. This represents a category notdiscussed in [18].

Being optimization critical to achieve the scalability needed to exploit Exascalecomputers, the long-term goal of the SPH-EXA [5] is to provide a parallel, opti-mized, state-of-the-art implementation of basic SPH operands with classical testcases used by the SPH community. This can be implemented at diﬀerent lev-els: employing state-of-the art programming languages, dynamic load balancingalgorithms, fault-tolerance techniques, and optimized tools and libraries.Interdisciplinary co-design and co-development [9] allow to adequately in-volve the developers of the parent codes (in our case ChaNGa, SPH-ﬂow, andSPHYNX), thus boosting and improving the design and the implementation ofthe SPH-EXA mini-app. We employ an interdisciplinary co-design approachthat goes beyond the classical hardware-software approach and holistically co-design between the SPH application , the algorithms it employs (SPH method,distributed tree, ﬁnding neighbors, load balancing, silent data corruption detec-tion and recovery, optimal checkpointing intervals), and the HPC systems (viaeﬃcient parallelization with various classical and modern programming mod-els). To gauge the eﬃciency of additional modern programming paradigms (e.g.,asynchronous tasking) against the classical ones for SPH, we also parallelize theSPH mini-app using classical paradigms, such as MPI, OpenMP, CUDA, andthe more recent OpenMP target oﬄoading, OpenACC, and HPX. This way,we cover all aspects that inﬂuence the performance and scalability of the SPHcodes on modern and future HPC architectures.In the ﬁrst development stage, the focus has been on identifying and imple-menting the vanilla SPH solver (i.e. that with the original equations of [20]), thegeneral workﬂow of which is shown in Figure 1. The solver has been designedfrom scratch as a distributed application. To achieve maximum eﬃciency andscalability on current Petascale and upcoming Exascale systems, one of the main4hallenges is to minimize the inter-process communication and synchronization.In particular, global synchronizations are avoided as much as possible as thiswould result in all processes having to communicate with all others, result-ing in global idleness and loss of eﬃciency. Instead, we focused on developing amethod that favors nearest neighbors communication between computing nodes,and only relies on collective communication for simple, yet necessary operations(e.g. computing the new size of the computational domain or the total energy).Currently, the SPH-EXA mini-app includes a state-of-the-art implementationof the SPH equations (see Section. 4.1), which has been built progressively atopthe optimal initial version.

The SPH-EXA mini-app is described in the following, together with its latestdevelopments, supported physics features, and parallel programming models.The code is open-source and is freely available on GitHub under the MIT licenseat: https://github.com/xxx/yyy . The SPH method has been implemented in the SPH-EXA mini-app followingthe formalism described in [1]. The main equations that express the calculationof the local density, and momentum and energy rates of change are: ρ a = (cid:88) b m b W ab ( h a ) , (1) (cid:18) dv i dt (cid:19) a = − (cid:88) b m b (cid:18) P a Ω a ρ a A i,ab ( h a ) + P b Ω b ρ b A i,ab ( h b ) (cid:19) + a AVi,a , (2) (cid:18) dudt (cid:19) a = P a Ω a ρ a (cid:88) b (cid:88) i m b ( v i,a − v i,b ) A i,ab ( h a )+12 (cid:88) b (cid:88) i ( v i,a − v i,b ) a AVi,a , (3) a AVi,a = 12 (cid:88) b m b Π (cid:48) ab (cid:26) A i,ab ( h a ) ρ a + A i,ab ( h b ) ρ b (cid:27) , (4)where subindex a is the particle index, b runs for its neighbors indexes, and i is the spatial dimension index. ρ , m , P , and v i are the density, mass, pres-sure, and velocity components of the particle, respectively. W ab is the SPHinterpolation kernel, which depends on the local spatial resolution h , namedsmoothing length. Ω are the grad-h terms that take into account the changesin the local smoothing length. A i,ab ( h a ) are the terms for integral approachto derivatives, IAD, (see [28, 1] for more details), and a AVi are the artiﬁcialviscosity acceleration components, where: Π (cid:48) ab = (cid:40) − α v sigab w ab for x ab · v ab < , otherwise, (5)5igure 1: SPH general workﬂow. Computational steps are performed for everyparticle in the simulation domain over a number of time-steps. Point-to-pointand collective communications update the particles information after certaincomputational steps. The complexity of each step is shown in red, where n isthe total number of particles and m is the number of neighboring particles ofeach particle and depends on the smoothing length h .6s the artiﬁcial viscosity disipation term. v sigab = c a + c b − w ab is an estimateof the signal velocity between particles a, b , c is the local speed of sound, and w ab = v ab · x ab / | x ab | . Finally, ρ − ab = 2 ( ρ a + ρ b ) − .Updates to the position and velocity of the particles are done with a nd order Press method, while energy is updated via a nd order Adams-Bashforthscheme.The SPH-EXA mini-app currently implements the Sinc-family of interpolat-ing kernels, the beneﬁts of which are described in [27]: W sn ( v, h, n ) = (cid:40) B n h d S n ( π v ) for ≤ v ≤ , for v > , (6)where S n ( . ) = [ sinc ( . )] n , n is a real exponent, B n a normalization constant,and d the spatial dimension. The function sinc is deﬁned as: sinc ( π v ) = sin ( π v ) / π v , where v = | x a − x b | /h a , and it is widely used in spectral theory.Additional SPH kernels can be plugged in, upon implementation (in theabove cases in only 6 LOC), proving how such a mini-app can be useful inallowing easy and quick modiﬁcations to experiment with alternative solutions.While not all SPH existing techniques and algorithms need to be implemented,a number of them, such as the SPH interpolation kernels, artiﬁcial viscositytreatments, or generalized volume elements, for example, can be later developedas separate interchangeable code modules. The SPH-EXA mini-app kernelshave been optimized to enable excellent automatic vectorization via compileroptimizations.Finally, as astrophysical and cosmological scenarios are within the scope ofthe SPH-EXA mini-app, we also implemented a multipolar expansion algorithmto evaluate self-gravity using the same tree structure that we use to ﬁnd neigh-boring particles and based on which we perform the domain decomposition (seeSection 4.3 for more details on the latter).These components are common to many production SPH codes and representthe basis over which we can optimize and extend the functionality of the SPH-EXA mini-app. The SPH workﬂow illustrated in Figure 1 is implemented in code as shown in thesequence diagram included in Figure 2. To ensure code ﬂexibility, extensibility,and readability we follow solid principles of the code design, use continuousintegration to avoid regression, and name functions and classes appropriately.Additionally we deﬁned coding standard for the project which can be appliedautomatically by the use of the clang-format tool.Regarding core-level optimization, particles are kept in memory in an orderthat matches the octree, so that particles that are close to each other in the 3Dspace, are also close in memory to minimize cache misses. Additionally, widelyused values are precomputed and stored either in memory or in lookup tables.The goal for developing SPH-EXA mini-app using parallel programming isto provide a reference implementation in MPI+X. The MPI standard is the defacto communication library for distributed applications in HPC, due to the lackof an outperforming alternative for inter-node communication. OpenMP is the de facto standard for parallel multithreaded programming on shared-memory7rchitectures, widely used in academic, industry, and government labs. Thehybrid MPI+OpenMP programming model represents a solid starting point forthe parallel and distributed execution of the SPH-EXA mini-app. However, thevanilla MPI+OpenMP does not fully exploit the heterogeneous parallelism inthe newest hardware architectures, Therefore, since version 4.5, the OpenMPstandard [21] supports oﬄoading of work to target accelerators. Other lan-guages directly targeting accelerators have been proposed and accepted by thecommunity, such as OpenACC (a directive-based programming model targetinga CPU+accelerator system, similar to OpenMP), CUDA (an explicit program-ming model for GPU accelerators) and OpenCL.The SPH-EXA mini-app currently implements HPX as an experimental de-velopment branch to explore the eﬃciency of task-based models and potential on(pre-)Exascale machines. The aims of exploring such task-based asynchronousprogramming model are to overlap computations and communications, bothintra-node and inter-node, and to remove all synchronizations and barriers.In terms of I/O from/to ﬁle system, the SPH-EXA mini-app has been de-signed from scratch to handle distributed data. The idea is that each computingnode generates or loads a subset of the data. We currently support MPI I/O(in a branch) to perform parallel I/O operations at large scale by reading inputand writing output binary data. We plan to move to HDF5 in the next months.In terms of precision, round-oﬀ errors in single precision can add up to levelsthat might render the calculation useless, even using code units, mostly dueto lack of accuracy in the evaluation of the gravitational forces using the tree.Nevertheless, this is based on the experience with the parent codes and we stillhave not studied this option in detail. Thus, we continue to use double precisiondata types while planning an exploration of a mixed-precision approach in futurework.The parallel program ﬂow of the SPH-EXA mini-app for a typical test case(Rotating Square Patch [4]) is shown in Figure 2. The diagram describes themain execution loop, which is used to compute a time-step, and the sub-loops,where OpenMP and GPU oﬄoading is used to accelerate the computationwithin single time-steps. Note that there are only three global synchronizations(

MPI_AllReduce ) across distributed-memory nodes. The ﬁrst synchronization isused to count the number of tree nodes when performing domain decomposition(described later in §4.3 and illustrated in Figure 3), the second synchronizationis used to compute the minimum time-step ( ∆ t ) needed to advance the systemin time, and the last synchronization is optional but useful for tracking totalmomentum and energy for verifying that they are conserved. The spatial 3D domain of particles is decomposed into cells using an oct-treedata structure. The oct-tree is global — every node keeps an identical copy ofit — and it is created only once at the beginning of the execution. The tree isthen simply updated every iteration: branches are added or removed from thetree as needed. This global tree only contains general information such as thenumber of particles per cell, and does need to be reﬁned down to single particles.Only the ‘top’ part of the tree is virtually shared and maintained identicalby every computing node. Compared to existing methods, this approach onlyrequires two collective communications (

MPI_AllReduce ): one to compute the8igure 2: Sequence diagram of the SPH-EXA mini-app illustrating its parallelprogram ﬂow, the use of the various parallel programming models, its complex-ity, and its synchronization points. 9umber of particles in every cell of the tree and one to compute the new sizeof the spatial domain after particles have moved. Nevertheless, we aim at alsocircumventing these two collective communications, whose ﬁnal objective is torebuild the global tree to maintain particle load balance. Because particles canonly move within the range of their local smoothing length, the upper levelsof the global tree change at a much longer timescale than that of the particlesdynamics. Therefore, global communications can be done scarcely while stillavoiding particle load imbalance.Domain cells (tree branches) are assigned to MPI processes using the globaltree to guarantee data locality, as shown in Figure 3. The assignment processgoes down to nodes that do not have particles fewer than the value globalBucket-Size (a user-deﬁned control parameter) which is typically equal to 128 particles.This way, certain processes can be assigned an additional tree node comparedto others, but the diﬀerence in particles per process will be no more than thevalue assigned to globalBucketSize (as mentioned above). This causes imbalanceequal to globalBucketSize/noOfParticlesPerMpiRank * 100%, i.e., for 1 millionparticles per MPI rank and globalBucketSize of 128, the particle assignmentimbalance will be equal to 0.01%.In practice, the imbalance is higher because MPI ranks will have diﬀerentnumbers of halo particles, i.e. neighbors of an SPH particle located in thememory of another node need to be communicated and stored for the currentiteration. Note that the amount of work does not increase (the number ofneighbors remains the same, e.g. 300 per particle). However, more memoryis required to store the additional particles. For example, the amount of haloparticles for 300 neighbors varies by as much as 100% for a few thousand particlesper MPI rank to 1-5% for a few million particles per MPI rank.In addition, particle data is reordered in the local memory of every com-puting node to follow the depth-ﬁrst-search ordering sequence of the oct-tree(similar to a Morton ordering sequence, or to using space ﬁlling curves, SFC, ingeneral). This ensures that particles that are close together in the spatial do-main are also close together in memory, resulting in increased cache eﬃciency.In terms of memory consumption, 1,459B are needed per particle at 1.35 GBper 1 Million particles and assuming 300 neighbors per particle. We use doubleprecision ﬂoating point numbers for all physical properties and integers for stor-ing particle’s neighbor indices. Currently we store the indices of the neighborsfor each particle.

Particles are exchanged between computing nodes in every simulation time-stepwhen they move from one sub-domain to another. Halo particles are exchangedthrice per-time-steps. The SPH-EXA mini-app relies on asynchronous point-to-point communications (

MPI_Isend / MPI_Irecv ) between nodes to avoid globalsynchronizations. Figure 4 shows the communication matrix of the SPH-EXAmini-app for an execution using 240 MPI processes. This illustration showsthat processes only communicate with their close neighbors, reducing latencyand contention compared with naive collective-communication based implemen-tations, while nodes remain synchronized with their neighbors in a loose fashion.10igure 3: Domain decomposition for the Square Patch test case at time iteration8,000 for 1M particles on 20 processes. Each color corresponds to a diﬀerentMPI process.Figure 4: Communication between individual processes for a 240 MPI processesexecution of the SPH-EXA mini-app with sender ID on the Y-axis and receiverID on the X-axis. 11 .5 Hybrid Computing

The SPH-EXA mini-app oﬄoads the most compute-intensive kernels (density,IAD and momentum equations) to GPUs. It can be compiled with or with-out support for multi-processing via MPI, multi-threading via OpenMP, andacceleration via OpenMP 4.5, OpenACC or CUDA. It supports the followingcombinations: • MPI + OpenMP • MPI + OpenMP + OpenMP 4.5 target oﬄoading • MPI + OpenMP + OpenACC oﬄoading • MPI + OpenMP + CUDAFigure 5 shows the average execution time for a single time-step using 8million particles on 4 hybrid CPU+GPU nodes comprising 12 cores (Intel XeonE5-2695) and a single

GPU card (NVIDIA Tesla P100), compiled using 5 dif-ferent compilers and -O2 (or equivalent) optimization ﬂag, for each availableparallel model. Some models are not available (denoted N/A) on all compilers,either because they are explicitly not supported (e.g., the Intel compiler doesnot support OpenACC oﬄoading) or because the compilers were not conﬁguredto support oﬄoading at the time of writing.Trigonometric functions are part of the SPH interpolation kernel (see Sec-tion 4.1), which is one the most commonly called function in every SPH iteration.As a result, math routines such as cos () , sin () and pow () are heavily used in thethe mini-app. Cray and Intel provide highly optimized math libraries with theircompiler (the CSML and the MKL, respectively), which overrides the standardmath function implementations and leads to inconsistencies in terms of runtimebetween the diﬀerent compilers for the MPI+OMP model. Note that this issueis not present with GPU oﬄoading and CUDA, which both rely on Nvidia’sCUDA implementation of these functions, hence the consistent runtimes for theother models.For this reason, we have implemented a lookup table which contains 20,000precomputed kernel values and we perform a linear interpolation with with therelative distance between neighboring particles at runtime. This avoids callsto complex and/or costly mathematical functions. In addition, most calls to pow ( x, n ) , rely on an integer ≤ n ≤ and so we can replace each call bythe corresponding inline x ∗ x ∗ x ∗ x ... which signiﬁcantly outperforms thestandard implementation for the GCC, Clang and PGI compilers. Overall, theseoptimizations lead to similar and consistent execution times for all compilers,comparable to using the Intel or Cray compiler (not shown Figure 5). The complexity of the scenarios simulated in CFD and Astrophysics usuallyprevent the possibility of performing simulations with continuously increasedresolution and diﬀerent codes, so that a convergence to zero diﬀerences on theresults can be found. Often, it is neither possible nor reasonable to obtainsuﬃcient computational resources to perform simulations that are “converged”throughout the computational domain in a mathematical sense. It is much more12 p i + o m p m p i + o m p + t a r g e t m p i + o m p + a cc m p i + o m p + c u d a clangcraypgignuintel 15.2 s 3.7 s N/A 3.6 s7.4 s 3.5 s N/A 3.8 s15.8 s N/A 4.3 s 4.4 s15.1 s N/A N/A 3.8 s5.1 s N/A N/A 3.8 s 02468101214 A v e r a g e E x e c u t i o n T i m e [ s / i t e r a t i o n ] Figure 5: Average execution time per time-step for 5 compilers and for everyparallel programming model available in the SPH-EXA mini-app.13mportant to limit the deviations in under-resolved regimes by enforcing funda-mental conservation laws. As a consequence, overall physics properties of thesimulated scenarios remain robust, even if slightly diﬀerent results are obtainedwhen using diﬀerent codes to solve the same set of equations. Therefore, com-paring results of diﬀerent hydrodynamical codes to the same initial conditionshas been proved to be highly beneﬁcial to gain understanding in complex scenar-ios, in the behavior of the codes, and to discover their strengths and weaknesses.These comparisons are common in CFD and Astrophysics [6, 7, 10, 11].The common test case chosen to validate the results obtained through themini-app against the ones of the parent codes is the Rotating Square Patch. Thistest was ﬁrst proposed by [4] as a demanding scenario for SPH simulations. Thepresence of negative pressure stimulates the emergence of tensile instabilitiesof numeric origin that induce unrealistic clumping of particles. This leads toan unphysical evolution of the ﬂuid and ultimately to the interruption of thesimulation. Nevertheless, these can be suppressed either using a tensile stabilitycontrol or increasing the order of the scheme [3]. As a consequence, this is acommonly used test in CFD to verify hydrodynamical codes, and it is employedin this work as a common test for the three parent codes.The setup here is similar to that of [4], but in 3D. The original test wasdevised in 2D, but the SPH codes used in this work normally operate in 3D. Touse a test that better represents the regular operability of the target codes, thesquare patch was set to [100 × particles in 2D and this layer was copied 100times in the direction of the z -axis. This results in a cube of particles that,when applying periodic boundary conditions in the z direction, is similar tosolving the original 2D test 100 times , while conserving the 3D formulation ofthe codes. The initial conditions are the same for all layers, hence they dependonly on the x and y coordinates. The initial velocity ﬁeld is given such that thesquare rotates rigidly: v x ( x, y ) = ωy ; v y ( x, y ) = − ωx, (7)where v x and v y are the x and y coordinates of the velocity, and ω = 5 rad/sis the angular velocity. The initial pressure proﬁle consistent with that velocitydistribution can be calculated from an incompressible Poisson equation andexpressed as a rapidly converging series: P = ρ ∞ (cid:88) m =0 ∞ (cid:88) n =0 − ω mnπ (cid:104)(cid:0) mπL (cid:1) + (cid:0) nπL (cid:1) (cid:105) × sin (cid:16) mπxL (cid:17) sin (cid:16) nπxL (cid:17) , where ρ is the density and L is the side length of the square.We also veriﬁed the results of the SPH-EXA mini-app against the outcomeof simulating the same rotating square patch test, with the same initial con- Note that in the 3D case it is expected that instabilities arise in the Z direction. Never-theless, in all our experiments, the velocity for all particles in the Z direction was, most of thetime, 6 orders of magnitude lower than in the X or Y directions. Only for few particles closeto the origin of coordinates (where all velocities are small) this ratio increases to − . Thatis the point at which we stopped the SPH simulation. Hence, we consider the 3D test stablefor the duration of the simulation and the results comparable to those found in the literaturefor the 2D case. (cid:126)L tot at t = 0 . s among all codes. The mini-app yields (cid:126)L tot = 8 . × g · cm , which diﬀers from the parent codes by ∼ . . Thisis well within the diﬀerences among the parent codes themselves. This smalldiscrepancy comes from the diﬀerent implementation of relevant sections of theSPH codes. Namely, the way of calculating gradients, volume elements, eval-uate the new time-step, and integrate the movement equations. Nevertheless,despite these capital diﬀerences in implementation among the codes, the resultsconverge well enough to ensure an adequate evolution of the system, pointing atthe fact that the fundamentals of the SPH technique are correctly implementedin the SPH-EXA mini-app. In this section, we report the results of a weak-scaling and a strong-scalingexperiment conducted on a top production supercomputer. The immediategoal is to assess the performance of the SPH-EXA mini-app on a Petascalesystem comprising both CPUs and GPUs. The long-term goal is to run onExascale systems. The weak-scaling eﬃciency is obtained as T seq /T par × .For strong-scaling, we plotted the average wall-clock time per iteration on alogarithmic scale. The average time per iteration is obtained from 10 iterationsdue to the limited number of node hours available for this project. However, asingle run up to 8000 iterations used for validation has shown that the time perexecution remains almost constant over 10 iterations with very small variations.We show that the proposed SPH-EXA mini-app shows promising results,executing simulations with 65 billion particles on 2,048 hybrid CPU+GPU nodesat 67% weak-scaling eﬃciency. Moreover, the achieved weak-scaling eﬃciencydecreased very little when scaling up from 512 to 2,048 nodes while keeping32 million particles/node. For strong-scaling, the average time per iterationdecreases almost linearly up to 1024 nodes for a ﬁxed problem size of =267 M particles. Memory consumption equals to 1,459B per particle, 1.35GBper 1 million particles.In addition, an independent performance audit of the SPH-EXA mini-apphas recently been performed by RWTH Aachen as part of the POP2 CoE servicefor European Scientiﬁc Applications. The report analyzed the eﬃciency andscalability of the SPH-EXA mini-app through a set of strong-scaling experimentswith up to 960 MPI ranks (40 nodes) and showed that both are very high. The experiments were performed on the hybrid partition of the Piz Daint supercomputer using PrgEnv-Intel, Cray MPICH 7.7.2 and OpenMP 4.0 (ver-sion/201611).We used the hybrid partition of more than 5,000 Cray XC50 nodes. Thesehybrid nodes are equipped with an Intel E5-2690 v3 CPU (codename Haswell,each with 12 CPU cores) and a PCIe version of the NVIDIA Tesla P100 GPU(Pascal architecture, 3584 CUDA cores) with 16 GB second generation high . Figure 6 and Figure 7 show the results of a weak-scaling experiment conductedon the hybrid partition of the Piz Daint supercomputer. Both ﬁgures show theeﬃciency of the execution with increasing number of nodes relative to usinga single computing node. The run performs the same 3D version of the CFDrotating square patch test described in § × . W e a k s c a li n g e ﬃ c i e n c y ( % ) Total Weak-Scaling Eﬃciency

Figure 6: Weak scaling of the SPH-EXA mini-app with 32 million particles pernodeFigure 7 shows the breakdown of eﬃciency by function. We observe that thedecrease in eﬃciency is mostly due to the Domain Decomposition and Build Treesteps. Domain Decomposition is the most complex part of the code, responsiblefor distributing the particle data and performing load balancing. While thenumber of particles per process remains constant, the global tree – the top partof the tree that is kept identical on all processes – becomes larger, and thisresults in increased depth and an additional overhead as the number of nodes W e a k s c a li n g e ﬃ c i e n c y ( % ) Domain DecompositionBuild TreeFind NeighborsDensityEquation of State (EOS)Integral Approach to Derivative (IAD)Momentum and Energy

Figure 7: Weak scaling per function of the SPH-EXA mini-app with 32 millionparticles per node(and particles) increases globally. However, the depth of the tree increases with log ( n ) , where n is the total number of particles. This is reﬂected in Figure 7,orange and blue lines, where the eﬃciency decreases rapidly when increasing thenode count from 1 to 64, but then remains stable and decreases very little (notethe logarithmic scale on the X-axis). All other computing steps scale almostperfectly.These results are very encouraging and indicate that good scalability canbe expected at much higher node counts. The SPH-EXA mini-app has notyet reached the tipping point where the code will not beneﬁt from increasedcomputing resources. If we assume the current eﬃciency trend, we can expectthe code to run in excess of a trillion particles on a system consisting of 31,250computing nodes with 32 million particles per nodes, which will be on par withfuture Exascale systems node counts. Figure 8 shows the results of a strong-scaling experiment conducted with theSPH-EXA mini-app on 267 million particles on the hybrid partition of the PizDaint supercomputer using up to 2,048 nodes. The results were obtained byrunning the mini-app blueusing MPI+OpenMP+CUDA. Additional dotted linesshow the execution time per function of the mini-app.blueWhile the SPH-EXA mini-app has a sub-optimal speedup, it is clearthat all functions beneﬁt from the additional computational resources, and thatthe execution time continues to decrease even when using 2,048 nodes. In this17 − − S t r o n g - s c a li n g e x ec u t i o n t i m e ( s ) Total Execution TimeTotal Execution Time (ideal)Domain DecompositionBuild TreeFind NeighborsDensityEquation of State (EOS)Integral Approach to Derivative (IAD)Momentum and Energy

Figure 8: Strong scaling of the SPH-EXA mini-app with 267 million particlesstrong-scaling scenario the total number of particles is ﬁxed and nodes receivefewer particles as we increase the number of computing nodes, e.g., there are4.1 million particles per node with 64 computing nodes, but only 130,000 pernode (i.e. just above 10,000 per core) when using 2,048 computing nodes. How-ever, particles maintain the same number of neighbors. This means the overlapbetween nodes increases and more halo particles are needed.blueFigure 9 shows the mean and max ratio of halo particles per node, withrespect to the number of particles initially assigned to a node, as described inSection 4.3. While only 30% of extra halos particles are needed on average using64 nodes, more than 200% are needed when using 2,048 nodes.blueThe impact of having more halo particles per node is two-fold: (1) ad-ditional memory is required and (2) particles are more likely to be distant inmemory space, which increases the number of cache-misses. Both aspects ad-versely impact performance. In a realistic setup, i.e., from 64 to 128 computenodes in Figure 9, we aim at a few million particles per nodes and a few hun-dred thousand particles per core / thread. Such conﬁgurations also minimizethe number of halo particles per node.

The weak-scaling experiment used of the available memory on every com-puting node and on every GPU. The execution time for a single time-step re-mained, on average, well under a minute. Hence, a high priority optimization isto reduce the memory footprint of the SPH-EXA mini-app, which will come atthe expense of longer execution times but will allow us to address and execute18 H a l o p a r t i c l e s r a t i o ( % ) Mean Halo RatioMax Halo Ratio

Figure 9:

Mean and max ratio of halo particles per computing node w.r.t.initially assigned particles per node, for the test case with 267 million particles.larger problem sizes in the future.Our current project allocation on Piz Daint did not allow us to run largerexperiments at the time of writing. We are currently working on obtaininglarger allocations to launch simulations using the full system.

The mini-app provides a modular tool that can be used by the community toplug-in additional physical processes or numerical methods. As there is a widevariety of very demanding physical scenarios that will be targeted by the SPH-EXA mini-app (e.g. supernova explosions, galaxy formation), having a highlyscalable and modular code will encourage others in the community to contributewith their own physics/numerics modules.Our short-term objectives include reducing the memory footprint of theSPH-EXA mini-app, which will help increasing the particle count per nodebeyond 32 million particles, adding additional features such as the possibilityto select the desired generalized volume elements, and simulating increasinglycomplex scenarios. Regarding fault tolerance, we plan to include our novel de-tection method for silent data corruption [29], which is speciﬁcally designed forSPH applications, add automatic validation (through conserved quantities), andimplement checkpointing at optimal intervals. Each new feature in the SPH-EXA mini-app will be iteratively tested, validated, and optimized for eﬃciencyfrom the start. 19s mid-term targets we aim to overlap computations with inter-node com-munications. Therefore, we will prepare the SPH-EXA mini-app to be Asyn-chronous Multi-Tasking (AMT) ready. For this, we will enable the opportunityto compare diﬀerent tasking frameworks such as OpenMP and HPX. Moreover,this will also open the opportunity to delegate independent tasks to accelera-tors and CPUs at the same time, which, in terms of the increasing hardwareheterogeneity foreseen in the pre-exascale and exascale systems, is crucial toachieve maximum load balance and performance. Additionally, we will employmultilevel (batch, process, and thread) scheduling for dynamic load balancingin the mini-app as a conﬁgurable option. This will allow us to systematicallyexplore the interplay between load balancing at these levels to achieve the bestpossible load balancing during execution.In the end, the SPH-EXA [5] follows a long-term vision, having set out tohave major impact in the scientiﬁc communities it gathers (and beyond in thelonger run). The aim of SPH-EXA [5] is to reach the capabilities of presentHPC systems and to push those of the future HPC infrastructures for simulat-ing the most complex phenomena at the highest resolution and longest physicaltimes, such as exploring the explosion mechanisms of Type Ia and Core CollapseSupernovae, including nuclear reaction treatments via eﬃcient nuclear networksand neutrino interactions with detailed transport, respectively. Another targetfor the SPH-EXA mini-app is modeling the small-scale ﬂuid-dynamics processesinvolved in the assembly of the planetary building blocks while capturing simul-taneously the large scale dynamics of the proto-planetary disks, and being ableto simulate an entire population of galaxies from high to low redshift in a cosmo-logical volume with enough resolution to model directly individual star formingregions. Hence, a ﬂexible and modular code that scales eﬃciently and robustlyon a large number of nodes nodes, accessing a mix of CPUs, GPUs, FPGAs,etc, is our long-term objective.

SPH-EXA [5] is an interdisciplinary project involving Computer Scientists, As-trophysicists, and Cosmologists, brings together state-of-the-art methods inboth ﬁelds under a broad scope, and addresses important challenges at all lev-els of existing and emerging infrastructures by exploring novel programmingparadigms and their combinations.In this work, we described the current status of a novel and scalable SPH-EXA mini-app for simulating the SPH method on large hybrid HPC systemseﬃciently utilizing both multi-core CPUs and GPUs. The SPH-EXA mini-appis open source and has no external dependencies. With fewer than 3,000 modernC++ LOC (with no compromise for performance), it is easy to run, understand,modify, and extend. The code is simple by design, making it easy to test andimplement new SPH kernels or other performance optimizations. We performedan initial exploration of the eﬃciency of diﬀerent combinations of hybrid CPUand GPU programing models, with diﬀerent compilers, veriﬁed and validated theSPH-EXA mini-app via computationally-demanding simulations, and comparedthe results with those obtained by the parent codes. We also conducted a weak-scaling experiment that shows excellent scaling, with 67% eﬃciency on 2,048hybrid nodes of the Piz Daint supercomputer with a loss of just 3% in eﬃciency20hen going from 512 to 2,048 nodes.At this initial stage of the project we are already in a good position to ex-plore the limits of what can be done on current top supercomputers. This opensthe doors to great number of potential applications, not only of the SPH-EXAmini-app, but also of the learned lessons. The SPH-EXA mini-app has drivensubstantial improvements to its three parent codes (SPHYNX, ChaNGa, andSPH-ﬂow) and has set a multi-directional knowledge transfer between the Com-puter Science, Astrophysics, and Computational Fluid Dynamics communities.

Acknowledgements

This work is supported in part by the Swiss Platform for Advanced ScientiﬁcComputing (PASC) project SPH-EXA [5] (2017- 2021). The authors acknowl-edge the support of the Swiss National Supercomputing Centre (CSCS) viaallocation project c16, where the calculations have been performed. Severalperformance results were provided by the Performance Optimisation and Pro-ductivity (POP) centre of excellence in HPC.

References [1] R. M. Cabezón, D. García-Senz, and J. Figueira,

SPHYNX: An accuratedensity-based SPH method for astrophysical applications , A&A, 606:A78,Oct. 2017.[2] M. Tremmel, M. Karcher, F. Governato, M. Volonteri, T. R. Quinn, A.Pontzen, L. Anderson, and J. Bellovary,

The Romulus cosmological simula-tions: a physical approach to the formation, dynamics and accretion modelsof SMBHs , MNRAS, 470, Sep. 2017.[3] G. Oger, D. Le Touzé, D. Guibert, M. de Leﬀe, J. Biddiscombe, J. Sou-magne, J-G. Piccinali,

On distributed memory MPI-based parallelization ofSPH codes in massive HPC context , Computer Physics Communications,200:1-14, March 2016.[4] A. Colagrossi,

A meshless Lagrangian method for free-surface and interfaceﬂows with fragmentation , PhD thesis, Università di Roma, 2005.[5] Florina M. Ciorba, Lucio Mayer, Rubén M. Cabezón, and David Imbert,

SPH-EXA: Optimizing Smoothed Particle Hydrodynamics for Exascale Com-puting , .[6] M. Liebendörfer, M. Rampp, H.-Th. Janka, and A. Mezzacappa, SupernovaSimulations with Boltzmann Neutrino Transport: A Comparison of Methods ,The Astrophysical Journal, Volume 620, Number 2, 2005.[7] O. Agertz, B. Moore, J. Stadel, D. Potter, F. Miniati, J. Read, L. Mayer,A. Gawryszczak, A. Kravtsov, A. Nordlund, F. Pearce, V. Quilis, D.Rudd, V. Springel, J. Stone, E. Tasker, R. Teyssier, J. Wadsley, and R.Walder,

Fundamental diﬀerences between SPH and grid methods . MonthlyNotices of the Royal Astronomical Society, 380: 963-978. doi:10.1111/j.1365-2966.2007.12183.x 218] R. F. Barret, C. T. Vaughan, and M. A. Heroux.

MiniGhost: A mini-app for exploring boundary exchange strategies using stencil computationsin scientiﬁc parallel computing . Technical report No. SAND2012-2437,Sandia National Laboratories, 2012. .[9] M. A. Heroux, D. W. Doerﬂer, P. S. Crozier, J. M. Willenbring, H. C.Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R.W. Numrich.

Improving performance via mini-applications . Technical reportNo. SAND2009-5574, Sandia National Laboratories, 2009. mantevo.org/MantevoOverview.pdf .[10] E. J. Tasker, R. Brunino, N. L. Mitchell, D. Michielsen, S. Hopton, F. R.Pearce, G. L. Bryan, and T. Theuns,

A test suite for quantitative compar-ison of hydrodynamic codes in astrophysics . Monthly Notices of the RoyalAstronomical Society, 390: 1267-1281. doi:10.1111/j.1365-2966.2008.13836.x[11] R. M. Cabezón, K-C. Pan, M. Liebendörfer, T. Kuroda, K. Ebinger, O.Heinimann, F-K. Thielemann, and A. Perego,

Core-collapse supernovaein the hall of mirrors. A three-dimensional code-comparison project . As-tron.Astrophys. 619 (2018) A118.[12] A. Stone, J. M. Dennis, M. Mills Strout.

The CGPOP Miniapp, version1.0 . Technical Report CS-11-103, Colorado State University, 2011.[13] R. Marcus.

MCMini: Monte Carlo on GPGPU . Technical Report LA-UR-12-23206, Los Alamos National Laboratory, 2012.[14] T. C. Team.

The CESAR codesign center: Early results . Technical report,Argonne National Laboratory, 2012.[15] P. Bauer, N. Wedi, and W. Deconinck.

ESCAPE: Energy-eﬃcient scalablealgorithms for weather prediction at Exascale . EU Horizon 2020.[16] W. Deconinck, P. Bauer, M. Diamantakis, M. Hamrud, C. Kühnlein, P.Maciel, G. Mengaldo, T. Quintino, B. Raoult, P. K. Smolarkiewicz, andN. P. Wedi.

Atlas: A library for numerical weather prediction and climatemodelling . Computer Phys. Comm., 220:188 – 204, 2017.[17] T. C. Schulthess.

Programming revisited . Nature Physics, 11(5):369–373,may 2015.[18] O. Messer, E. D’Azevedo, J. Hill, W. Joubert, S. Laosooksathit, and A.Tharrington.

Developing MiniApps on modern platforms using multiple pro-gramming models . In 2015 IEEE International Conference on Cluster Com-puting. IEEE, sep 2015.[19] S. Rosswog,

Astrophysical smooth particle hydrodynamics , New Astron-omy Reviews, Volume 53, Issues 4–6, 2009, Pages 78-104, ISSN 1387-6473,https://doi.org/10.1016/j.newar.2009.08.007.[20] J. J. Monaghan, and R. A. Gringold,

Shock Simulation by the ParticleMethod SPH , Journal of Computational Physics, Volume 52, Issue 2, p. 374-389, 1983. 2221]

OpenMP 4.5 Speciﬁcations . ,2018.[22] J. Reinders. Intel Threading Building Block: Outﬁtting C++ for Multi-coreProcessor Parallelism . O’Reilly Media, 2007. ISBN:9780596514808.[23] A. D. Robison,

Composable Parallel Patterns with Intel Cilk Plus . Comput-ing in Science & Engineering, vol. 15, no. 2, pp. 66-71, March-April 2013.[24] B. L. Chamberlain et al.,

Parallel Programmability and the Chapel Lan-guage . The International Journal of High Performance Computing Applica-tions, vol. 21, no. 3, Aug. 2007, pp. 291–312.[25] H. Carter Edwards, C. R. Trott and D. Sunderland,

Kokkos . Journal ofParallel and Distributed Computing, vol. 74, no. 12, Dec. 2014, pp. 3202-3216.[26] H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio and D. Fey.

HPX: ATask Based Programming Model in a Global Address Space . In Proceedingsof the 8th International Conference on Partitioned Global Address SpaceProgramming Models (PGAS ’14). ACM, New York, NY, USA, 2014.[27] R. M. Cabezón, D. Garcia-Senz, A. Relaño,

A one-parameter family ofinterpolating kernels for Smoothed Particle Hydrodynamics studies , J. Com-put. Phys., 227:8523-8540, 2008.[28] D. García-Senz, R. M. Cabezón and J. A. Escartín,

Improving smoothedparticle hydrodynamics with an integral approach to calculating gradients ,A&A, 538, A9, February 2012.[29] Aurélien Cavelan, Rubén M. Cabezón and Florina M. Ciorba,