SWIFT: Using task-based parallelism, fully asynchronous communication, and graph partition-based domain decomposition for strong scaling on more than 100,000 cores
Matthieu Schaller, Pedro Gonnet, Aidan B. G. Chalk, Peter W. Draper
SSWIFT : Using task-based parallelism, fully asynchronouscommunication, and graph partition-based domaindecomposition for strong scaling on more than 100 000cores.
Matthieu Schaller
Institute for ComputationalCosmology (ICC)Department of PhysicsDurham UniversityDurham DH1 3LE, UK [email protected]
Pedro Gonnet
School of Engineering andComputing SciencesDurham UniversityDurham DH1 3LE, UKGoogle Switzerland GmbH8002 Zürich, Switzerland
Aidan B. G. Chalk
School of Engineering andComputing SciencesDurham UniversityDurham DH1 3LE, UK
Peter W. Draper
Institute for ComputationalCosmology (ICC)Department of PhysicsDurham UniversityDurham DH1 3LE, UK
ABSTRACT
We present a new open-source cosmological code, called
SWIFT ,designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared/ distributed-memory architectures.
SWIFT was designed from thebottom up to provide excellent strong scaling on both commodityclusters (Tier-2 systems) and Top100-supercomputers (Tier-0 sys-tems), without relying on architecture-specific features or special-ized accelerator hardware. This performance is due to three maincomputational approaches: • Task-based parallelism for shared-memory parallelism, whichprovides fine-grained load balancing and thus strong scalingon large numbers of cores. • Graph-based domain decomposition , which uses the taskgraph to decompose the simulation domain such that the work ,as opposed to just the data , as is the case with most partition-ing schemes, is equally distributed across all nodes. • Fully dynamic and asynchronous communication , in whichcommunication is modelled as just another task in the task-based scheme, sending data whenever it is ready and defer-ring on tasks that rely on data from other nodes until it ar-rives.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
PASC ’16, June 08 - 10, 2016, Lausanne, Switzerland c (cid:13) http://dx.doi.org/10.1145/2929908.2929916 In order to use these approaches, the code had to be re-writtenfrom scratch, and the algorithms therein adapted to the task-basedparadigm. As a result, we can show upwards of 60% parallel effi-ciency for moderate-sized problems when increasing the number ofcores 512-fold, on both x86-based and Power8-based architectures.
Keywords
Physics; Cosmology; Fluid dynamics; Smooth Particle Hydrody-namics; Task-based parallelism; Asynchronous data transfer; Ex-treme scaling
1. INTRODUCTION
For the past decade physical limitations have kept the speed ofindividual processor cores constrained, so instead of getting faster ,computers are getting more parallel . Systems containing up to 64general-purpose cores are becoming commonplace, and the num-ber of cores can be expected to continue growing exponentially,e.g. following Moore’s law, much in the same way processor speedswere up until a few years ago.As a consequence, in order to get faster, computer programs needto rely on better exploitation of this massive parallelism, i.e. theyneed to exhibit strong scaling , the ability to run roughly N timesfaster when executed on N times as many processors. Withoutstrong scaling, massive parallelism can still be used to tackle ever larger problems, but not make fixed-size problems faster .Although this switch from growth in speed to growth in paral-lelism has been anticipated and observed for quite some time, verylittle has changed in terms of how we design and implement parallelcomputations. Branch-and-bound synchronous parallelism usingOpenMP[9] and MPI[23], as well as domain decompositions basedon geometry or space-filling curves [27] are still commonplace, de-spite both the architectures and problem scales having changed dra-matically since their introduction. a r X i v : . [ c s . D C ] J un he design and implementation of SWIFT [15, 25, 13], a large-scale cosmological simulation code built from scratch, providedthe perfect opportunity to test some newer approaches, i.e. task-based parallelism, fully asynchronous communication, and graphpartition-based domain decompositions.This paper describes these techniques, which are not exclusiveto cosmological simulations or any specific architecture, as well asthe results obtained with them. While [15, 13] already describesthe underlying algorithms in more detail, in this paper we focus onthe parallelization strategy and the results obtained on larger Tier-0systems.
2. PARTICLE-BASED HYDRODYNAMICS
Smoothed Particle Hydrodynamics [12, 20] (SPH) uses particlesto represent fluids. Each particle p i has a position x i , velocity v i ,internal energy u i , mass m i , and a smoothing length h i . The parti-cles are used to interpolate any quantity Q at any point in space asa weighted sum over the particles: Q ( r ) = (cid:88) i m i Q i ρ i W ( (cid:107) r − r i (cid:107) , h ) (1)where Q i is the quantity at the i th particle, h is the smoothinglength , i.e. the radius of the sphere within which data will be con-sidered for the interpolation, and W ( r, h ) is the smoothing kernel or smoothing function . Several different forms for W ( r, h ) exist,each with their own specific benefits and drawbacks.The particle density ρ i used in (1) is itself computed similarly: ρ i = (cid:88) j, r ij Density computation: For each particle p i , loop over all par-ticles p j within range of p i and evaluate (2).2. Force computation: For each particle p i , loop over all parti-cles p j within range of each other and evaluate (3) and (4).Finding the interacting neighbours for each particle constitutesthe bulk of the computation. Many codes, e.g. in Astrophysics sim-ulations [12], rely on spatial trees for neighbour finding [12, 16, 24,26], i.e. k -d trees [6] or octrees [19] are used to decompose the sim-ulation space. In Astrophysics in particular, spatial trees are also asomewhat natural choice as they are used to compute long-rangegravitational interactions via a Barnes-Hut [5] or Fast Multipole[8] method. The particle interactions are then computed by travers-ing the list of particles and searching for their neighbours in thetree. In current state-of-the-art simulations (e.g. [22]), the gravitycalculation corresponds to roughly one quarter of the calculationtime whilst the hydrodynamics scheme takes approximately half ofthe total time. The remainder is spent in the astrophysics modules,which contain interactions of the same kind as the SPH sums pre-sented here. Gravity could, however, be vastly sped-up comparedto commonly used software by employing more modern algorithmssuch as the Fast Multipole Method [8].Although such tree traversals are trivial to parallelize, they haveseveral disadvantages, e.g. with regards to computational efficiency,cache efficiency, and exploiting symmetries in the computation (see[13] for a more detailed analysis). 3. PARALLELIZATION STRATEGY One of the main concerns when developing SWIFT was to breakwith the branch-and-bound type parallelism inherent to parallel codesusing OpenMP and MPI, and the constant synchronization betweencomputational steps it results in.If synchronization is the main problem, then asynchronicity isthe obvious solution. We therefore opted for a task-based approachfor maximum single-node, or shared-memory, performance. Thisapproach not only provides excellent load-balancing on a singlenode, it also provides a powerful model of the computation that canbe used to distribute the work equitably over a set of distributed-memory nodes using general-purpose graph partitioning algorithms.Finally, the necessary communication between nodes can itself bemodelled in a task-based way, interleaving communication seam-lessly with the rest of the computation. Task-based parallelism is a shared-memory parallel programmingparadigm in which a computation is broken-down in to a set of tasks which can be executed concurrently. In order to ensure thatthe tasks are executed in the right order, e.g. that data needed byone task is only used once it has been produced by another task, dependencies between tasks are specified and strictly enforced bya task scheduler. Additionally, if two tasks require exclusive accessto the same resource, yet in no particular order, they are treatedas conflicts and the scheduler ensures that they are not executedconcurrently. Computations described in this way then parallelizetrivially: each processor repeatedly grabs a task for which all de-pendencies have been satisfied and executes it until there are notasks left.The main advantages of using a task-based approach are • The order in which the tasks are processed, and how they aressigned to each processor is completely dynamic and adaptsautomatically to load imbalances. • If the dependencies and conflicts are specified correctly, thereis no need for expensive explicit locking, synchronization, oratomic operations to deal with most concurrency problems. • Each task has exclusive access to the data it is working on,thus improving cache locality and efficiency.Task-based parallelism is not a particularly new concept and there-fore several implementations thereof exist, e.g. Cilk [7], QUARK[28], StarPU [3], SMP Superscalar [1], OpenMP 3.0 [11], Intel’sTBB [21], and, to some extent, Charm++ [17].Since none of these existing taks-based libraries provided theflexibility required to experiment with different scheduling and com-munication techniques, ( SWIFT is an interdisciplinary effort be-tween Computer Science and Astrophysics to study not only cos-mological phenomena, but also novel simulation algorithms andparallel computing techniques) we chose to implement our owntask scheduler in SWIFT , which has since been back-ported as thegeneral-purpose Q UICK S CHED task scheduler [14]. In Q UICK S CHED and SWIFT , task dependencies are specified explicitly, as opposedto being implicitly derived from data dependencies, allowing us tomore easily build complex task hierarchies. This also allowed us toextend the scheduler with the concept of task conflicts and integratethe asynchronous communication scheme described further on.Despite its advantages, and the variety of implementations, task-based parallelism is rarely used in practice (notable exceptions in-clude the PLASMA project [2] which uses QUARK/StarPU, andthe deal.II project [4] which uses Intel’s TBB). The main prob-lem is that to effectively use task-based parallelism, most compu-tations need to be completely redesigned to fit the paradigm, whichis usually not an option for large and complex codebases.Since we were re-implementing SWIFT from scratch, this was notan issue. The tree-based neighbour-finding described above was re-placed with a more task-friendly approach as described in [13], inwhich the domain is first decomposed into a grid of cells of edgelength larger or equal to the largest particle radius. An initial setof interaction tasks is then defined over all cells and pairs of neigh-bouring cells, such that if two particles are close enough to interact,they are either in the same cell or they span a pair of neighbouringcells. These initial interaction tasks are then refined by recursivelysplitting cells that contain more than a certain number of particlesand replacing tasks that span a pair of split cells with tasks span-ning the neighboring sub-cells. The resulting refined set of taskscontains all the cells and pairs of cells over which particle interac-tions must be computed.The dependencies between the tasks are set following equations(2), (3), and (4), i.e. such that for any cell, all the tasks computingthe particle densities therein must have completed before the par-ticle forces can be computed, and all the force computations musthave completed before the particle velocities may be updated. Thetask hierarchy is shown in Fig. 1, where the particles in each cellare first sorted (round tasks) before the particle densities are com-puted (first layer of square tasks). Ghost tasks (triangles) are usedto ensure that all density computations on a cell of particles havecompleted before the force evaluation tasks (second layer of squaretasks) execute. Once all the force tasks on a cell of particles havecompleted, the integrator tasks (inverted triangles) update the par-ticle positions and velocities. The decomposition was computedsuch that each cell contains ∼ particles, which leads to tasksof up to a few milliseconds each.Due to the cache-friendly nature of the task-based computations,and their ability to exploit symmetries in the particle interactions, forcedensityghostsortintegrator ρ x ρ x Node 1 Node 2 Figure 1: Task hierarchy for the SPH computations in SWIFT ,including communication tasks. Arrows indicate dependencies,i.e. an arrow from task A to task B indicates that A depends on B . The task color corresponds to the cell or cells it operates on,e.g. the density and force tasks work on individual cells or pairsof cells. The task hierarchy is shown for three cells: blue, yel-low, and purple. Although the data for the yellow cell resides onNode 2, it is required for some tasks on Node 1, and thus needsto be copied over during the computation using send / recv tasks (diamond-shaped).Figure adapted from [13]. the task-based approach is already more efficient than the tree-based neighbour search on a single core, and scales efficiently toall cores of a shared-memory machine [13]. Given a task-based description of a computation, partitioning itover a fixed number of ranks (using the MPI terminology) is rela-tively straight-forward: we create a cell hypergraph in which: • Each node represents a single cell of particles, and, • each edge represents the tasks, connecting the cells.Since in the particular case of SWIFT each task references at mosttwo cells, the cell hypergraph is just a regular cell graph .Any partition of the cell graph represents a partition of the com-putation, i.e. the nodes belonging to each partition each belong toa rank, and the data belonging to each cell resides on the parti-tion/rank to which it has been assigned. Any task spanning cellsthat belong to the same partition needs only to be evaluated on thatrank/partition, and tasks spanning more than one partition need tobe evaluated on both ranks/partitions.If we then weight each edge with the computational cost associ-ated with the tasks, then finding a good cell distribution reduces tofinding a partition of the cell graph such that the maximum sum ofthe weight of all edges within and spanning in a partition is minimal(see Fig. 2). Since the sum of the weights is directly proportionalto the amount of computation per rank/partition, minimizing themaximum sum corresponds to minimizing the time spent on theslowest rank. Computing such a partition is a standard graph prob- igure 2: Illustration of the task-based domain decomposi-tion in which the tasks (circles) are hyperedges that connectone or more resources (rectangles). The resources are parti-tioned along the thick dotted line. The blue and orange tasksare executed on the respective partitions, whereas the greentasks/hyperedges along the cut line are executed on both. Thecost of this partition is the sum of the green tasks, which arecomputed twice, as well as the cost imbalance of the tasks exe-cuted in each partition. lem and several software libraries which provide good solutions ,e.g. METIS [18] and Zoltan [10], exist.In SWIFT , the graph partitioning is computed using the METISlibrary. The cost of each task is initially approximated via theasymptotic cost of the task type and the number of particles in-volved. After a task has been executed, it’s effective computationalcost is computed and used.Note that this approach does not explicitly consider any geomet-ric constraints, or strive to partition the amount of data equitably.The only criteria is the computational cost of each partition, forwhich the task decomposition provides a convenient model. Weare therefore partitioning the computation , as opposed to just the data .Note also that the proposed partitioning scheme takes neither thetask hierarchy, nor the size of the data that needs to be exchangedbetween partitions/ranks into account. This approach is thereforeonly reasonable in situations in which the task hierarchy is widerthan flat, i.e. the length of the critical path in the task graph is muchsmaller than the sum of all tasks, and in which communication la-tencies are negligible. Although each particle cell resides on a specific rank, the particledata will still need to be sent to any neighbouring ranks that havetasks that depend on this data, e.g. the the green tasks in Fig. 2.This communication must happen twice at each time-step: once tosend the particle positions for the density computation, and thenagain once the densities have been aggregated locally for the force Computing the optimal partition for more than two nodes is con-sidered NP-hard. computation.Most distributed-memory codes based on MPI [23] separate com-putation and communication into distinct steps, i.e. all the ranksfirst exchange data, and only when the data exchange is completedoes computation start. Further data exchanges only happen oncecomputation has finished, and so on. This approach, although con-ceptually simple and easy to implement, has three major draw-backs: • The frequent synchronization points between communica-tion and computation exacerbate load imbalances, • the communication phase consists mainly of waiting on la-tencies, during which the node’s CPUs usually run idle, and • during the computation phase, the communication network isleft completely unused, whereas during the communicationphase, all ranks attempt to use it at the same time.It is for these reasons that in SWIFT we opted for a fully dynamicand asynchronous communication model in which local data is sentwhenever it is ready, data from other ranks is only acted on onceit has arrived, and there is no separation into communication andcomputational phases. In practice this means that no rank will sitidle waiting on communication if there is any computation that canbe done.This fits in quite naturally within the task-based framework bymodelling communication as just another task type, i.e. adding tasksthat send and receive particle data between ranks. For every taskthat uses data that resides on a different rank, send and recv tasksare generated automatically on the source and destination ranks re-spectively. At the destination, the task is made dependent of the recv task, i.e. the task can only execute once the data has actuallybeen received. This is illustrated in Fig. 1, where data is exchangedacross two ranks for the density and force computations and theextra dependencies are shown in red.The communication itself is implemented using the non-blocking MPI_Isend and MPI_Irecv primitives to initiate communica-tion, and MPI_Test to check if the communication was successfuland resolve the communication task’s dependencies. In the task-based scheme, strictly local tasks which do not rely on commu-nication tasks are executed first. As data from other ranks arrive,the corresponding non-local tasks are unlocked and are executedwhenever a thread picks them up.One direct consequence of this approach is that instead of a sin-gle send / recv call between each pair of neighbouring ranks, onesuch pair is generated for each particle cell. This type of communi-cation, i.e. several small messages instead of one large message, isusually strongly discouraged since the sum of the latencies for thesmall messages is usually much larger than the latency of the sin-gle large message. This, however, is of no concern in SWIFT sincenobody is actively waiting to receive the messages in order, andthe communication latencies are covered by local computations. Anice side-effect of this approach is that communication no longerhappens in bursts involving all the ranks at the same time, but ismore or less evenly spread over the entire computation, and is there-fore less demanding of the communication infrastructure. 4. SCALING TESTS In this section we present some strong scaling tests of the SWIFT code on different architectures for a representative cosmology prob-lem. igure 3: The initial density field computed from the initial par-ticle distribution used for our tests. The density ρ i of the parti-cles spans 8 orders of magnitude, requiring smoothing lengths h i changing by a factor of almost across the simulationvolume. In order to provide a realistic setup, the initial particle distribu-tions used in our tests were extracted by resampling low-redshiftoutputs of the EAGLE project [22], a large suite of state-of-the-artcosmological simulations. By selecting outputs at late times (red-shift z = 0 . ), we constructed a simulation setup which is repre-sentative of the most expensive part of these simulations, i.e. whenthe particles are highly-clustered and no longer uniformly distributed.This distribution of particles is shown on Fig. 3. In order to fitour simulation setup into the limited memory of some of the sys-tems tested, we have randomly down-sampled the particle countof the output to = 5 . × , = 2 . × and = 5 . × particles with periodic boundary conditionsrespectively. Scripts to generate these initial conditions are pro-vided with the source code. We then run the SWIFT code for 100time-steps and average the wall clock time of these time-steps afterhaving removed the first and last ones, where disk I/O occurs.The initial load balancing between nodes is computed in the firststeps and re-computed every 100 time steps, and is therefore notincluded in the timings. Particles were exchanged between nodeswhenever they strayed too far beyond the cells in which they origi-nally resided.On all the machines, the code was compiled out of the box, with-out any tuning, explicit vectorization, or exploiting any other spe-cific features of the underlying hardware. For our first test, we ran SWIFT on the COSMA-5 DiRAC2 DataCentric System located at the University of Durham. The systemconsists of 420 nodes with 2 Intel Sandy Bridge-EP Xeon E5-2670 icc.dur.ac.uk/index.php?content=Computing/Cosma http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI Figure 4: The particles for the initial conditions shown onFig. 3 colored according to the node they belong to after a load-balancing call on 32 nodes. As can be seen, the domain decom-position follows the cells in the mesh but is not made of regularcuts. Domains have different shapes and sizes. CPUs clocked at . with each 128 GByte of RAM. The 16-core nodes are connected using a Mellanox FDR10 Infiniband 2:1blocking configuration.This system is similar to many Tier-2 systems available in mostuniversities or computing facilities. Demonstrating strong scalingon such a machine is essential to show that the code can be effi-ciently used even on the type of commodity hardware available tomost researchers.The code was compiled with the Intel compiler version 2016.0.1and linked to the Intel MPI library version 5.1.2.150 and METISlibrary version 5.1.0.The simulation setup with particles was run on that systemusing 1 to 256 threads on 1 to 16 nodes, the results of which areshown on Fig. 5. For this strong scaling test, we used one MPIrank per node and 16 threads per node (i.e. one thread per physi-cal core). We also ran on one single node using up to 32 threads,i.e. up to one thread per physical and virtual core. Fig. 4 shows thedomain decomposition obtained via the task-graph decompositionalgorithm described above for the 16 node run. For our next test, we ran SWIFT on the SuperMUC x86 phase 1thin nodes located at the Leibniz Supercomputing Center in Garch-ing near Munich, currently ranked 23rd in the Top500 list . Thissystem consists of 9 216 nodes with 2 Intel Sandy Bridge-EP XeonE5-2680 8C at . CPUS. Each 16-core node has 32 GByte of RAM. The nodes are split in 18 “islands” of 512 nodes within http://ark.intel.com/products/64583/Intel-Xeon-Processor-E5-2680-(20M-Cache-2_70-GHz-8_00-GTs-Intel-QPI) S p ee dup Threads0 . . . . . . . P a r a ll e l E f fi c i e n c y c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s With hyper − threading on a single nodeOn multiple nodes SWIFT Strong scaling on Cosma-5 with 51M particles from 1 to 256 threads Figure 5: Strong scaling test on the COSMA-5 machine (see text for hardware description). Left panel: Code Speed-up. Rightpanel: Corresponding parallel efficiency. Using 16 threads per node (no use of hyper-threading) with one MPI rank per node, agood parallel efficiency (60%) is achieved when increasing the thread count from 1 (1 node) to 128 (8 nodes) even on this relativelysmall test case. The dashed line indicates the efficiency when running on one single node but using all the physical and virtual cores(hyper-threading). As these CPUs only have one FPU per core, the benefit from hyper-threading is limited to a 20% improvementwhen going from 16 cores to 32 hyperthreads. which communications are handled via an Infiniband FDR10 non-blocking Tree. These islands are then connected using a 4:1 PrunedTree.This system is similar in nature to the COSMA-5 system usedin the previous set of tests but is much larger, allowing us to testscalability at a much larger scale.The code was compiled with the Intel compiler version 2015.5.223and linked to the Intel MPI library version 5.1.2.150 and METISlibrary version 5.0.2.The simulation setup with particles was run using 16 to2048 nodes (4 islands) and the results of this strong scaling test areshown in Fig. 6. For this test, we used one MPI rank per node and16 threads per node, i.e. one thread per physical core. For our last set of tests, we ran SWIFT on the JUQUEEN IBMBlue Gene/Q system located at the Jülich Supercomputing Cen-ter, currently ranked 11th in the Top500 list. This system consistsof 28 672 nodes with an IBM PowerPC A2 processor running at . and 16 GByte of RAM each. Of notable interest is thepresence of two floating units per compute core. The system iscomposed of 28 racks containing each 1 024 nodes. The networkuses a 5D torus to link all the racks.This system is larger than the SuperMUC supercomputer de-scribed above and uses a completely different processor and in-struction set. We use it here to demonstrate that our results are notdependant on the hardware being used.The code was compiled with the IBM XL compiler version 30.73.0.13and linked to the corresponding MPI and METIS library versions4.0.2.The simulation setup with particles was first run using 512nodes with one MPI rank per node and varying only the number ofthreads per node. The results of this test are shown in Fig. 7. 5. DISCUSSION & CONCLUSIONS The strong scaling results presented in the previous sections onthree different machines demonstrate the ability of our frameworkto scale on both small commodity machines, thanks to the use oftask-based parallelism at the node level, and on the largest ma-chines (Tier-0 systems) currently available, thanks to the task-baseddomain distribution and asynchronous communication schemes. Wewould like to emphasize that these results were obtained for a real-istic test case without any micro-level optimization or explicit vec-torization.Excellent strong scaling is also achieved when increasing thenumber of threads per node (i.e. per MPI rank, see fig. 7), demon-strating that the description of MPI (asynchronous) communica-tions as tasks within our framework is not a bottleneck. One com-mon conception in HPC is that the number of MPI communicationsbetween nodes should be kept to a minimum to optimize the effi-ciency of the calculation. Our approach does exactly the oppositewith large number of point-to-point communications between pairsof nodes occurring over the course of a time-step. For example,on the SuperMUC machine with 32 nodes (512 cores), each MPIrank contains approximately . × particles in . × cells. SWIFT will generate around 58 000 point-to-point asynchronousMPI communications (a pair of send and recv tasks) per node and per timestep . Each one of these communications involves, onaverage, no more than 6 kB of data. Such an insane number of smallmessages is discouraged by most practitioners, but seems to workswell in practice. Dispatching communications over the course ofthe calculation and not in short bursts, as is commonly done, mayalso help lower the load on the network.One time-step on nodes of the JUQUEEN machine takes 500 1000 1500 2000Nodes0500100015002000 S p ee dup Nodes0 . . . . . . . P a r a ll e l E f fi c i e n c y c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s SWIFT Strong scaling on SuperMUC with 512M particles from 16 to 2048 nodes and 16 threads per node Figure 6: Strong scaling test on the SuperMUC phase 1 machine (see text for hardware description). Left panel: Code Speed-up. Right panel: Corresponding parallel efficiency. Using 16 threads per node (no use of hyper-threading) with one MPI rank per node,an almost perfect parallel efficiency is achieved when increasing the node count from 16 (256 cores) to 2 048 (32 768 cores). S p ee dup Nodes0 . . . . . . P a r a ll e l E f fi c i e n c y c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s c o r e s SWIFT Strong scaling on JUQUEEN with 216M particles from 32 to 8192 nodes and 32 threads per node Figure 8: Strong scaling test on the JUQUEEN Blue Gene machine (see text for hardware description). Left panel: Code Speed-up. Right panel: Corresponding parallel efficiency. Using 32 threads per node (2 per physical core) with one MPI rank per node, aparallel efficiency of more than is achieved when increasing the node count from 32 (512 cores) to 8 192 (131 072 cores). On8 192 nodes there are fewer than 27 000 particles per node and only a few hundred tasks, making the whole problem extremely hardto load-balance effectively. 10 15 20 25 30Threads per node51015202530 S p ee dup M u lti − t h r ea d i ng Strong scaling on 512 JUQUEEN nodes Figure 7: Strong scaling test of the hybrid component of thecode. The same calculation is performed on 512 node of theJUQUEEN Blue Gene supercomputer (see text for hardwaredescription) using a single MPI rank per node and varying onlythe number of threads per node. The code displays excellentscaling even when all the cores and hardware multi-threads arein use. 63 ms of wall-clock time. All the loading of the tasks, communi-cations and running of the tasks takes place in that short amount oftime. Our framework can therefore load-balance a calculation over . × threads with remarkable efficiency.We emphasize, as was previously demonstrated in [13], that SWIFT is also much faster than the G ADGET -2 code [24], the de-facto stan-dard in the field of particle-based cosmological simulations. Thesimulation setup that was run on the COSMA-5 system takes . of wall-clock time per time-step on cores using SWIFT whilstthe default G ADGET -2 code on exactly the same setup with thesame number of cores requires 32 s . The excellent scaling perfor-mance of SWIFT allows us to push this number further by simplyincreasing the number of cores, whilst G ADGET -2 reaches its peakspeed (for this problem) at around 300 cores and stops scaling be-yond that. This unprecedented scaling ability combined with fu-ture work on vectorization of the calculations within each task willhopefully make SWIFT an important tool for future simulations incosmology and help push the entire field to a new level.These results, which are in no way restricted to astrophysicalsimulations, provide a compelling argument for moving away fromthe traditional branch-and-bound paradigm of both shared and dis-tributed memory programming using synchronous MPI and OpenMP.Although fully asynchronous methods, due to their somewhat an-archic nature, may seem more difficult to control, they are con-ceptually simple and easy to implement . The real cost of using atask-based approach comes from having to rethink the entire com-putation to fit the task-based setting. This may, as in our specific The SWIFT source code consists of less than 21 K lines of code,of which roughly only one tenth are needed to implement the task-based parallelism and asynchronous communication. case, lead to completely rethinking the underlying algorithms andreimplementing a code from scratch. Given our results to date,however, we do not believe we will regret this investment. SWIFT 6. ACKNOWLEDGEMENTS This work would not have been possible without Lydia Heck’shelp and expertise. We acknowledge the help of Tom Theuns,James Willis, Bert Vandenbroucke, Angus Lepper and other con-tributors to the SWIFT code.We thank Heinrich Bockhorst and Stephen Blair-Chappell from IN - TEL INTEL through establishment of the ICC as an INTEL parallelcomputing centre (IPCC). 7. REFERENCES [1] SMP Superscalar (SMPSs) User’s Manual, BarcelonaSupercomputing Center , 2008.[2] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak,J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numericallinear algebra on emerging architectures: The PLASMA andMAGMA projects. In Journal of Physics: Conference Series ,volume 180, page 012037. IOP Publishing, 2009.[3] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier.StarPU: A unified platform for task scheduling onheterogeneous multicore architectures. Concurrency andComputation: Practice and Experience, Special Issue:Euro-Par 2009 , 23:187–198, 2011.[4] W. Bangerth, R. Hartmann, and G. Kanschat. deal.II – ageneral purpose object oriented finite element library. ACMTrans. Math. Softw. , 33(4):24/1–24/27, 2007.[5] J. Barnes and P. Hut. A hierarchical o (n log n)force-calculation algorithm. Nature , 1986.[6] J. L. Bentley. Multidimensional binary search trees used forassociative searching. Communications of the ACM ,18(9):509–517, 1975.[7] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson,K. Randall, and Y. Zhou. Cilk: An efficient multithreadedruntime system , volume 30. ACM, 1995.[8] J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptivemultipole algorithm for particle simulations. SIAM Journalon Scientific and Statistical Computing , 9(4):669–686, 1988.[9] L. Dagum and R. Menon. OpenMP: an industry standardAPI for shared-memory programming. ComputationalScience & Engineering, IEEE , 5(1):46–55, 1998.[10] K. Devine, E. Boman, R. Heaphy, B. Hendrickson, andC. Vaughan. Zoltan data management services for paralleldynamic applications. Computing in Science & Engineering ,4(2):90–96, 2002.11] A. Duran, R. Ferrer, E. Ayguadé, R. M. Badia, andJ. Labarta. A proposal to extend the OpenMP tasking modelwith dependent tasks. International Journal of ParallelProgramming , 37(3):292–305, 2009.[12] R. A. Gingold and J. J. Monaghan. Smoothed particlehydrodynamics-theory and application to non-spherical stars. Monthly notices of the royal astronomical society ,181:375–389, 1977.[13] P. Gonnet. Efficient and scalable algorithms for smoothedparticle hydrodynamics on hybrid shared/distributed-memoryarchitectures. SIAM J. Scientific Computing , 37(1), 2015.[14] P. Gonnet, A. B. G. Chalk, and M. Schaller. QuickSched:Task-based parallelism with dependencies and conflicts. ArXiv e-prints , Jan. 2016.[15] P. Gonnet, M. Schaller, T. Theuns, and A. B. Chalk. Swift:Fast algorithms for multi-resolution sph on multi-corearchitectures. In .Trondheim, Norway, 2013.[16] L. Hernquist and N. Katz. TREESPH-A unification of SPHwith the hierarchical tree method. The Astrophysical JournalSupplement Series , 70:419–446, 1989.[17] L. Kalé and S. Krishnan. CHARM++: A PortableConcurrent Object Oriented System Based on C++. InA. Paepcke, editor, Proceedings of OOPSLA’93 , pages91–108. ACM Press, September 1993.[18] G. Karypis and V. Kumar. A fast and high quality multilevelscheme for partitioning irregular graphs. SIAM Journal onscientific Computing , 20(1):359–392, 1998.[19] D. Meagher. Geometric modeling using octree encoding. Computer Graphics and Image Processing , 19(2):129–147,1982.[20] D. J. Price. Smoothed particle hydrodynamics andmagnetohydrodynamics. Journal of Computational Physics ,231:759–794, Feb. 2012.[21] J. Reinders. Intel threading building blocks: outfitting C++for multi-core processor parallelism . O’Reilly Media,Incorporated, 2007.[22] J. Schaye, R. A. Crain, R. G. Bower, M. Furlong,M. Schaller, T. Theuns, C. Dalla Vecchia, C. S. Frenk, I. G.McCarthy, J. C. Helly, A. Jenkins, Y. M. Rosas-Guevara,S. D. M. White, M. Baes, C. M. Booth, P. Camps, J. F.Navarro, Y. Qu, A. Rahmati, T. Sawala, P. A. Thomas, andJ. Trayford. The EAGLE project: simulating the evolutionand assembly of galaxies and their environments. MNRAS ,446:521–554, Jan. 2015.[23] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, andJ. Dongarra. MPI: The Complete Reference (Vol. 1): Volume1-The MPI Core , volume 1. MIT press, 1998.[24] V. Springel. The cosmological simulation code GADGET-2. MNRAS , 364:1105–1134, Dec. 2005.[25] T. Theuns, A. Chalk, M. Schaller, and P. Gonnet. Swift:task-based hydrodynamics and gravity for cosmologicalsimulations. In Proceedings of the 3rd InternationalConference on Exascale Applications and Software , pages98–102. University of Edinburgh, 2015.[26] J. Wadsley, J. Stadel, and T. Quinn. Gasoline: a flexible,parallel implementation of TreeSPH. New Astronomy ,9(2):137–158, 2004.[27] M. S. Warren and J. K. Salmon. A parallel hashed oct-treen-body algorithm. In Proceedings of the 1993 ACM/IEEEconference on Supercomputing , pages 12–21. ACM, 1993. [28] A. YarKhan, J. Kurzak, and J. Dongarra.