tinyMD: A Portable and Scalable Implementation for Pairwise Interactions Simulations
Rafael Ravedutti L. Machado, Jonas Schmitt, Sebastian Eibl, Jan Eitzinger, Roland Leißa, Sebastian Hack, Arsène Pérard-Gayot, Richard Membarth, Harald Köstler
ttinyMD: A Portable and Scalable Implementation forPairwise Interactions Simulations
Rafael Ravedutti L. Machado , Jonas Schmitt , Sebastian Eibl , JanEitzinger , Roland Leißa , Sebastian Hack , Arsène Pérard-Gayot , RichardMembarth , Harald Köstler Abstract
This paper investigates the suitability of the AnyDSL partial evaluationframework to implement tinyMD: an efficient, scalable, and portable simula-tion of pairwise interactions among particles. We compare tinyMD with theminiMD proxy application that scales very well on parallel supercomputers. Wediscuss the differences between both implementations and contrast miniMD’sperformance for single-node CPU and GPU targets, as well as its scalability onSuperMUC-NG and Piz Daint supercomputers. Additionaly, we demonstratetinyMD’s flexibility by coupling it with the waLBerla multi-physics framework.This allow us to execute tinyMD simulations using the load-balancing mecha-nism implemented in waLBerla.
Keywords:
Molecular Dynamics, Partial Evaluation, High PerformanceComputing, Load Balancing
1. Introduction
Nowadays, compute-heavy simulation software typically runs on differenttypes of high-end processors to solve these simulations in a reasonable amount oftime. Nevertheless, this high-end hardware requires highly specialized code that (cid:73)
This work is supported by the Federal Ministry of Education and Research (BMBF) aspart of the MetaDL, Metacca, and ProThOS projects. Chair for System Simulation, University of Erlangen-Nürnberg. Regional Computer Center Erlangen (RRZE), University of Erlangen-Nürnberg. Saarland University, Saarland Informatics Campus. German Research Center for Artificial Intelligence (DFKI), Saarland Informatics Campus.
Preprint submitted to Journal of Computational Science September 17, 2020 a r X i v : . [ c s . PF ] S e p s precisely adapted to the respective hardware in order to get anywhere nearpeak performance. What is more, in some cases different algorithmic variantsare more suitable towards different kinds of hardware. For example, writingapplications for a GPU, a massively parallel device with a very peculiar memoryhierarchy, is very different from implementing applications for a CPU.Very large problems even call for distributed systems in which we have topartition the workload among multiple computers within a network. This entailsproper data communication between these systems; transfer latencies should behidden as much as possible.In this paper we focus on molecular dynamics (MD) simulations. Thesestudy the interactions among particles and how these interactions affect theirmotion. To achieve peak performance on these simulations, the implementationmust consider the best data access pattern for the target architecture.We base our implementation tinyMD on AnyDSL—a partial evaluation frame-work to write high-performance applications and libraries. We compare this im-plementation with miniMD, a parallel and scalable proxy application that alsocontains GPU support; miniMD is written in C++ and is based on Kokkos [1].Additionally, we also couple our tinyMD application with the waLBerla [2, 3]multi-physics simulation framework. This allows us to exploit its load-balancingmechanism [4] implementation within tinyMD simulations. We discuss the ad-vantages that tinyMD provides in order to ease the coupling with different tech-nologies.We use the Lennard-Jones potential model in order to calculate the forcesof the atoms in the experiments comparing to miniMD. Whereas in the exper-iments for load-balancing, we rely on the Spring-Dashpot force model, whichis common in discrete element methods (DEM) simulations [5]. Our goal is tocompare and discuss the difference regarding the implementation and perfor-mance for both applications. We present experimental results for single-nodeperformance in both CPU and GPU target processors, and we also show ourexperimental results for multi-node CPU processors in the SuperMUC-NG su-percomputer, and multi-node GPU accelerators in the Piz Daint supercomputer.2 .1. Contributions In summary this paper makes the following contributions beyond our previ-ous work [6]: • We present our new tinyMD distributed-memory parallel implementationbased upon AnyDSL and discuss its differences to miniMD—a typicalC++ implementation based upon the Kokkos library to target GPU de-vices. For example, we use higher-order functions to build array abstrac-tions. Due to AnyDSL’s partial evaluator these abstractions are not ac-companied by any overhead (see Section 4). • We demonstrate how flexible tinyMD is implemented with AnyDSL, andhow its communication code can be coupled with the waLBerla frameworkto use its load-balancing feature in tinyMD simulations (see Section 5). • We show performance and scalability results for various CPU and GPUarchitectures including multi-CPU results on up to 2048 nodes of theSuperMUC-NG cluster (98304 cores), and multi-GPU results on up to1024 nodes of the Piz Daint cluster (see Section 6).In order to make this paper as self-contained as possible, Section 3 providesnecessary background for both AnyDSL and MD simulations after discussingrelated work in Section 2.
2. Related Work
There is a wide effort on porting MD simulations to different target archi-tectures while delivering good performance and scalability. The majority of thedeveloped frameworks and applications use the traditional approach of using ageneral-purpose language to implement the code for the simulations.GROMACS [7, 8, 9] is a versatile MD package used primarily for dynamicalsimulations of bio-molecules. It is implemented in C/C++ and supports mostcommonly used CPUs and GPUs. The package was initially released in 1991 and3as been carefully optimized since then. GROMACS supports single instruction,multiple data (SIMD) in order to enhance the instruction-level parallelism andhence increase CPU throughput by treating elements as clusters of particlesinstead of individual particles [10]. GROMACS provides various hand-writtenSIMD kernels for different SIMD instruction set architectures (ISAs).LAMMPS [11, 12] is an MD code with focus on material modeling. It con-tains potentials for soft and solid-state materials, as well as coarse-grain sys-tems. It is implemented in C++ and uses message passing interface (MPI) forcommunication when executed on several nodes.The Mantevo proxy application miniMD [13, 14] is based on LAMMPS andperforms and scales well (in a weak sense) on distributed memory systems, albeitwith a very restricted set of features. Since miniMD provides a very usefulbenchmark case for MD simulations in terms of performance and scalability, wechoose it as the base line for tinyMD.MESA-PD [15] is a general particle dynamics framework. Its design principleof separation of code and data allows to introduce arbitrary interaction modelswith ease. It can thus also be used for molecular dynamics. Via its code gener-ation approach using Jinja templates it can be adapted very effectively to anysimulation scenario. As successor of the pe rigid particle dynamics frameworkit inherits its scalability [16] and advanced load balancing functionalities [17].All these MD applications have dedicated, hand-tuned codes for each sup-ported target architecture. This is the most common approach for today’s MDapplications [18, 19, 20]. It requires much more effort as all these different codevariants must not only be implemented, but also optimized, tested, debugged,and maintained independently from each other.Other applications rely on domain-specific languages to generate parallelparticle methods [21], but this approach requires the development of specificcompilation tools that are able to generate efficient code. In this paper we ex-plore the benefits from using the AnyDSL framework, where we shallow-embed[22, 23] our domain-specific library into its front-end Impala and can then ab-stract device, memory layout and communication pattern through higher-order4unctions. Thus, we use the compiler tool-chain provided by AnyDSL and donot need to develop specific compilation tools for MD or particle methods.
3. Background
AnyDSL [24] is a compiler framework designed to speed up the developmentof domain-specific libraries. It consists of three major components: the frontend
Impala , its intermediate representation (IR)
Thorin [25], and a runtime system.The syntax of Impala is inspired from Rust and allows both imperative andfunctional programming.
In contrast to Rust, Impala features a partial evaluator, which is controlledvia filter expressions [26]: fn @(?n) pow(x: i32 , n: i32 ) -> i32 { if n == 1 {z} if n % 2 == 0 { let y = pow(x, n / 2);y * y} else {x * pow(x, n - 1)}} In this example, the function pow is assigned the filter expression ?n (introducedvia @ ). The ? -operator evaluates to true whenever its argument is staticallyknown by the compiler. Now, at every call site, the compiler will instantiate thecallee’s filter expression by substituting all parameters with the correspondingarguments of that call site. If the result of this substitution is true , the functionwill get executed and any call sites within will receive the same treatment.In the case of the function pow above, this means that the following callsite will be partially evaluated because the argument provided for n is knownat compile-time: 5 et y = pow(z, 5); The result will be recursively expanded to: let y = z * pow(z, 4);
Then: let z2 = pow(z, 2); let z4 = z2 * z2; let y = z * z4;
And finally: let z1 = z; let z2 = z1 * z1; let z4 = z2 * z2; let y = z * z4;
Note how this expansion is performed symbolically . In contrast to C++ tem-plates or constexpr essions, the Impala compiler does not need to know thevalue of the parameter x to execute the function pow . Another important feature of Impala is its ability to perform triggered codegeneration . Impala offers built-in functions to allow the programmer to executeon the GPU, vectorize, or parallelize a given function. For instance, the syntaxto trigger code generation for GPUs that support CUDA is as follows: let acc = cuda_accelerator(device_index); let grid = (n, 1, 1); let block = (64, 1, 1); with work_item in acc.exec(grid, block) { let id = work_item.gidx();buffer(id) = pow(buffer(id), 5);} This snippet launches a CUDA kernel on the GPU with index device_index on a 1D grid and a block size of × × . MD simulations are widely used today to study the behavior of microscopicstructures. These simulations reproduce the interactions among atoms in these6tructures on a macroscopic scale while allowing us to observe the evolution inthe system of particles in a time-scale that is simpler to analyze.Different areas such as material science to study the evolution of the systemfor specific materials, chemistry to analyze the evolution of chemical processes,or biology to reproduce the behavior of certain proteins and bio-molecules resortto simulations of MD systems.A system to be simulated constitutes of a number of atoms, the initial state(such as the atoms’ position or velocities), and the boundary conditions of thesystem. Here we use periodic boundary conditions (PBC) in all directions,hence when particles cross the domain, they reappear on the opposite side withthe same velocity. Fundamentally, the evolution of MD systems is computedby solving Newton’s second law equation, also known as the equation of mo-tion (Equation 1). Knowing the force, it is possible to track the positions andvelocities of atoms by integrating the same equation. F = m ˙ v = ma (1)The forces of each atom are based on its interaction with neighboring atoms.This computation is usually described by a potential function or force field.Many different potentials can be used to calculate the particle forces, and eachone is suitable for different types of simulation. In this work, the potential usedfor miniMD comparison is the Lennard-Jones potential (Equation 2)—a pairpotential for van-der-Waals forces. Consider x i the position of the particle i ,the force for the Lennard-Jones potential can be expressed as: F LJ ( x i , x j ) = 24 (cid:15) (cid:18) σx ij (cid:19) (cid:34) (cid:18) σx ij (cid:19) − (cid:35) x ij | x ij | (2)Here, x ij is the distance vector between the particles i and j , (cid:15) determines thewidth of the potential well, and σ specifies at which distance the potential is .For our load balancing experiments, we use the Spring-Dashpot (Equation 3)contact model to provide a static simulation and to demonstrate the flexibilityof our application. This contact model is commonly used on DEM to simulaterigid bodies interactions (instead of point masses on MD simulations). Therefore7e consider particles as spheres on these experiments. Consider the position x i and velocity v i for the particle i . Its force is defined as: F SD ( x i , v i , x j , v j ) = Kξ + γ ˙ ξ (3)where ξ = ˆ x ij ( σ − | x ij | )Θ( σ − | x ij | ) (4) ˙ ξ = − ˆ x ij (ˆ x ij · v ij )Θ( σ − | x ij | ) (5)and K being the spring constant, γ the dissipation constant, ˆ x ij the unit vector x ij | x ij | , σ the sphere diameter, and Θ the Heaviside step function.Naively, an MD implementation iterates over all atom pairs in the system.This requires to consider a quadratic number of pairs. For short range interac-tions, however, we can use a Verlet list [27] instead to keep track of all nearbyparticles within a certain cutoff radius . Then, we compute the potential forcefor an atom only for the neighbors stored in its list. We regularly update theVerlet list to keep track of new atoms that enter the cutoff radius.Although it is not necessary to build the Verlet list at all time steps, it isstill a costly procedure since it requires the iteration over each pair of atoms.To enhance the building performance, cell lists can be used to segregate atomsaccording to their spatial position. As long as the cell sizes are equal to orgreater than the cutoff radius, it is just necessary to iterate over the neighborcells to build the Verlet list. Figure 1 depicts the creation of the neighbor listsfor a particle using cell lists. Neighbor lists can be created for just half ofthe particle pairs (known as half neighbor lists). This allows for simultaneousupdates of both particle forces within the same iteration (the computed force issubtracted from the neighbor particle) but requires atomic operations to preventrace conditions. In some implementations the cell size can be less than the cutoff radius, however the cellneighborhood to be checked must be extended accordingly. , s rr > Figure 1: Neighbor list creation example. In this case the neighbor list is built for the redparticle, and only the blue particles in the neighbor cells are checked for comparison. Particleswithin the green area with radius r are inserted into the red particle neighbor list. The sizefor the cells s can be greater or equal than r , but not less than it . The value for r is usuallythe cutoff radius plus a small value (called verlet buffer) so neighbor lists do not have to beupdated every time step.
4. The tinyMD Library
In this section we introduce and discuss tinyMD . We focus on the maindifferences in writing portable code with AnyDSL as opposed to traditionalC/C++ implementations. We explore the benefits achieved by using higher-order functions to map code to different target devices, different data layoutsand to implement flexible code for MPI communication. In the following we usethe term particle to refer to atoms. In order to map code parts for execution on the target device, tinyMD relieson the
Device abstraction. It contains functions to allocate memory, transferdata, launch a loop on the target device, and perform even more complex device-dependent procedures such as reductions: struct
Device {alloc: fn ( i64 ) -> Buffer,transfer: fn (Buffer, Buffer) -> (), https://github.com/AnyDSL/molecular-dynamics oop_1d: fn ( i32 , fn ( i32 ) -> ()) -> (), // ... } Similar to a Java interface or abstract virtual methods in C++, a
Device in-stance allows tinyMD to abstract from the concrete implementation of thesefunctions—in this case device-specific code. Unlike Java interfaces or virtualmethods however, the partial evaluator will remove these indirections by spe-cializing the appropriate device-specific code into the call-sites. Each devicesupported in tinyMD possesses its own implementation: a function that returnsa struct instance which contains several functions, and hence, “implementsthe
Device interface”. For example, the CPU implementation looks like this: fn @device() -> Device {Device {alloc: |size| { alloc_cpu(size) },transfer: |from, to| {}, // no copy required loop_1d: @|n, f| {vectorized_range(get_vector_width(), 0, n, |i, _| f(i));}, // ... }} The vectorized_range function iterates over the particles. This function inturn calls the vectorize intrinsic that triggers the
Region Vectorizer [28] tovectorize the LLVM IR generated by the Impala compiler. Although not cov-ered in this paper, tinyMD employs the parallel intrinsic to spawn multiplethreads for multiple CPU cores.The GPU implementation looks like this: fn @device() -> Device {Device {alloc: |size| { acc.alloc(size) },transfer: |from, to| { copy(from, to) },loop_1d: @|n, f| { // build grid_size and block_size for n acc.exec(grid_size, block_size, |work_item| { let i = work_item.bidx() * work_item.bdimx() + work_item.tidx(); if i < n { f(i); }}); , // ... }} The
Device abstraction is sufficient to map our application to different tar-gets, it takes care of separating the parallel execution strategy from the computekernels and provides basic functions for the device. Together with the data man-agement abstractions (see Section 4.2) we attain performance portability withsubstantially less effort.
To store our simulation data, we built an
ArrayData structure, which de-fines a multi-dimensional array in tinyMD. This data structure takes care ofmemory allocation in both device and host (if required). The actual accessesare abstracted from with the help of
ArrayLayout . This idiom is similar tothe
Device abstraction and makes arrays in tinyMD both target and layout-agnostic. struct
ArrayData {buffer: Buffer,buffer_host: Buffer,size_x: i32 ,size_y: i32 ,host_mirror: bool }; struct ArrayLayout {index_2d_fn: fn (ArrayData, i32 , i32 ) -> i32 ,add_fn: fn (ArrayData, i32 , real_t) -> ()}; With this definition,
ArrayData just contains sizes for the x and y dimensions.This can be easily extended to more dimensions, but for our application, twodimensions are enough. The following list demonstrates how to implement asimple row-major order array layout similar to C arrays (again, this is akin to“implementing the ArrayLayout interface”): fn @row_major_order_array(is_atomic: bool ) -> ArrayLayout { rrayLayout {index_2d_fn: @|array, x, y| { x * array.size_y + y },add_fn: @|array, i, v| { do_add(is_atomic, array, i, v) }}} Note that the layout can also be made atomic, by using the do_add functionthat uses atomic addition when the first argument is_atomic is true . TheAnyDSL partial evaluator takes care of generating the proper add instructions inthe final code with zero overhead when the is_atomic flag is known at compile-time. Similarly, we define a column-major order layout as used in Fortran: fn @column_major_order_array(is_atomic: bool ) -> ArrayLayout {ArrayLayout {index_2d_fn: @|array, x, y| { y * array.size_x + x },add_fn: @|array, i, v| { do_add(is_atomic, array, i, v) }}} For a more complex data layout, we show the definition of a clustered array,also known as an Array of Struct of Arrays (AoSoA). This is an array of structswhose elements are again clusters of elements as opposed to individual elements.With cluster sizes that are a power of two, we get: fn @clustered_array(is_atomic: bool , cluster_size: i32 ) -> ArrayLayout { let mask = cluster_size - 1; let shift = get_shift_size(cluster_size) - 1;ArrayLayout {index_2d_fn: @|array, x, y| { let i = x >> shift; let j = x & mask;cluster_size * (i * array.size_y + y) + j},add_fn: @|array, i, v| { do_add(is_atomic, array, i, v) }}} For most common use cases, the cluster size is known at compile-time, whichallows AnyDSL to specialize this data layout by directly replacing the pre-computed values of shift and mask . If more than one cluster size is used inthe application, different specialized codes are generated.12o bind both array structures mentioned above, we write different templatefunctions that perform operations such as get and set on values. We alsoprovide target functions to abstract whether the device or the host buffer mustbe used. The following snippet shows these target functions and the templatefor reading real_t elements in a 2-dimensional array: fn @array_dev(array: ArrayData) -> Buffer { array.buffer } fn @array_host(array: ArrayData) -> Buffer { array.buffer_host } type ArrayTargetFn = fn (ArrayData) -> Buffer; fn @array_2d_get_real(target_fn: ArrayTargetFn, layout: ArrayLayout, array: ArrayData,i: i32 , j: i32 ) -> real_t {bitcast[&[real_t]](target_fn(array).data)(layout.index_2d_fn(array, i, j))} In this case, we also resort to the partial evaluator to generate the specializedfunctions for all used targets and layouts. We can also map non-primitive datatypes to our arrays, the following example shows how to map our
Vector3D structure to N × arrays. The structure contains x , y and z elements of type real_t : // Get Vector3D value in 2D array with abstract target and layout fn @array_2d_get_vec3(target_fn: ArrayTargetFn, layout: ArrayLayout, array: ArrayData,i: i32 ) -> Vector3D {Vector3D {x: array_2d_get_real(target_fn, layout, array, i, 0),y: array_2d_get_real(target_fn, layout, array, i, 1),z: array_2d_get_real(target_fn, layout, array, i, 2)}} Notice that through set() and get() functions it is also possible to ab-stract data over scratchpad memory (such as shared or texture memory on GPUdevices). This could be done by first staging data into these memory and thenproviding set and get functions that operate on them.Finally, we can use these templates functions to implement the abstractionsfor our particle data. For this, we created a
Particle data structure that13olds the proper functions to write and read particle information: struct
Particle {set_position: fn ( i32 , Vector3D) -> (),get_position: fn ( i32 ) -> Vector3D, // ... }; The
Particle interface takes care of generating easy to use functions thatmap particle information to our
ArrayData primitives. These functions canset and get particle properties, iterate over the neighbor lists and perform anyoperation that relies on the data layout and target. Thus, we can generate
Particle structures for specific targets (host or device) and for distinct lay-outs. This is done through the make_particle function: fn @make_particle(grid: Grid, target_fn: ArrayTargetFn,vec3_layout: ArrayLayout, nb_layout: ArrayLayout) -> Particle {Particle {set_position: @|i, p| array_2d_set_vec3(target_fn, vec3_layout, grid.positions, i, p),get_position: @|i| array_2d_get_vec3(target_fn, vec3_layout, grid.positions, i), //... }} The structure generated by make_particle contains the specialized func-tions for the layouts we specified, these functions access buffers in the memoryspace provided by the target function. To achieve a full abstraction, we thenput it altogether with the device loop and the layout definition: fn @ParticleVec3Layout() -> ArrayLayout { row_major_order_array( false ) } fn @particles(grid: Grid, f: fn ( i32 , Particle) -> ()) -> () {device().loop_1d(grid.nparticles, |i| {f(i, make_particle(grid, array_dev, ParticleVec3Layout(),NeighborlistLayout()));});} Data layout definitions such as the
ParticleVec3Layout can also bewritten in the device-specific codes. Consider the
NeighborlistLayout lay-14ut as an example: For the CPU it is optimal to store data in a particle-major order to enhance locality per thread during access, and therefore im-proving cache utilization. For GPUs, neighbor-major order is preferable be-cause it enhances coalesced memory access. Since each GPU thread computesa different particle, we keep all the nth neighbor for each particle contigu-ous in memory, causing threads to access these data in the same iterationand therefore reducing the number of required transactions to load the en-tire data. Nevertheless, the
NeighborlistLayout has the same definition asthe
ParticleVec3Layout shown above for CPU targets, and is defined as a column_major_order_array for GPU targets.To show how simple it is to use our particles abstraction, consider thefollowing example to compute particle-neighbor potentials: particles(grid, |i, particle| { let pos_i = particle.get_position(i);particle.neighbors(i, |j| { let pos_j = particle.get_position(j); let del = vector_sub(pos_i, pos_j); let rsq = vector_len2(del); if rsq < rsq_cutoff { let f = potential(del, rsq); // update force for i (and j if half neighbor lists is being// used) with the computed force }});}); We also provide a compute_potential syntactic-sugar that can be usedas follows to compute the Lennard-Jones potential (consider the lamba-functionpassed as the potential function used previously): let sigma6 = pow(sigma, 6);compute_potential(grid, half_nb, rsq_cutoff, @|del, rsq| { let sr2 = 1.0 / rsq; let sr6 = sr2 * sr2 * sr2 * sigma6; let f = 48.0 * sr6 * (sr6 - 0.5) * sr2 * epsilon;vector_scale(f, del) // returns the force to be added });
AnyDSL can generate specialized variants for compute_potential with15ull- and half-neighbor lists through the half_nb parameter. This avoids extraconditions on full neighbor lists kernels that are just required with half neighborlists. The same specialization is performed when building the neighbor lists.All these abstractions provide a very simple and extensible way to work withdifferent devices and data management. The template functions and layoutspecifications can be extended to support more complex operations and map todifferent data types. Furthermore, these can be used to yield a domain-specificinterface (such as the
Particle abstraction) to improve the usability of ourlibrary. This is accomplished with no overhead in the final generated code withAnyDSL.
The communication for our code can be separated in three routines as is alsodone in miniMD. These routines are listed below: • Exchange: exchange particles that overlap the domain for the currentrank, this particle becomes a local particle on the process it is sent to.This operation is not performed on every time step, but at an interval of n time steps, n is adjustable by the application. • Border definition: defines the particles that are in the border of the currentrank’s domain, these particles are sent as ghost particles to the neighborprocesses. This routine also is performed at every n time steps. • Synchronization: uses the border particles defined in the border definition,and just sends them at every time step of the simulation. Since the numberof particles to be sent is known beforehand, it is a less costly routine.We provide a generic way to abstract the communication pattern on theseroutines through a higher-order function named communication_ranks . Thisfunction receives both conditional functions to check whether particles must beexchanged or sent as ghost particles to neighbor ranks. PBC correction can also16e proper applied using these conditional functions, which greatly improves theflexibility for our communication code.These conditional functions also help when coupling tinyMD with waLBerlabecause they separate the logic to check particle conditions from the packingand MPI send/recv routines that stay untouched. When using waLBerla domainpartitioning, particle positions must be checked against the supplied waLBerladata structures (block forest domain), whereas for miniMD pattern we can justwrite simple comparisons in the 6 directions. The communication_ranks implementation separating both strategies is listed as follows: // Types for condition and communication functions type
CondFn = fn (Vector3D, fn (Vector3D) -> ()) -> (); type CommFn = fn ( i32 , i32 , CondFn, CondFn) -> (); fn communication_ranks(grid: Grid, body: CommFn) -> () { if use_walberla() { // Use walberla for communication//... range(0, nranks as i32 , |i| { let rank = get_neighborhood_rank(i);body(rank, rank, // Conditions to send and receive from rank @|pos, f| { /* Check for border condition with walberla */ },@|pos, f| { // Exchange condition for j in range(0, get_rank_number_of_aabbs(i)) { let p = pbc_correct(pos); // PBC correction if is_within_domain(p, get_neighbor_aabb(get_rank_offset(i) + j)) {f(p); break ()}}});});} else { // Use 6D stencil communication like miniMD body(xnext, xprev, // Conditions to send to xnext @|p, f| { if p.x > aabb.xmax - spacing * 2.0 { f(pbc_correct(p)); }},@|p, f| { if p.x > aabb.xmax - spacing { f(pbc_correct(p)); }});body(xprev, xnext, // Conditions to send to xprev @|p, f| { if p.x < aabb.xmin + spacing * 2.0 { f(pbc_correct(p)); }},@|p, f| { if p.x < aabb.xmin + spacing { f(pbc_correct(p)); }}); // Analogous for y and z dimensions }} The communication_ranks function can be used to pack particles and17o perform the MPI communications since it provides the functions to check forwhich particles must be packed and the ranks to send and receive data. Thefollowing snippet shows a simple usage to obtain the particles to be exchanged: communication_ranks(grid, |rank, _, _, exchange_positions| {particles(grid, |i, particle| {exchange_positions(particle.get_position(i), @|pos| { // Here pos is a particle position that must be exchanged,// it already contains the proper PBC adjustments });});});
On miniMD, a 6-stencil communication pattern is hard-coded to the simu-lation, which turns the code inflexible to be coupled to other technologies. Thisalso shows an important benefit from higher-order functions, as code can bemore easily coupled with other technologies by simply replacing functionality.
This section summarizes the benefits of using the AnyDSL framework forboth tinyMD and MD applications in general.
Separation of Concerns.
We separate our force potential computation logic fromdevice-specific code and rely on higher-order functions to map it to the propertarget. This substantially reduces the effort in writing portable applications.We employ the same technique to abstract other procedures such as the MPIcommunication calls. Here, we adopt higher-order functions to define commu-nication patterns (see Section 4.3).
Layers of Abstractions.
We hide device-dependent code like our data layout(see Section 4.2) or device mapping (see Section 4.1) behind functions. Theseabstractions have zero overhead due to the partial evaluator (see below).
Compile-Time Code Specialization.
We utilize Impala’s partial evaluator to gen-erate faster specialized code when one or more parameters are known at compile-time. This significantly reduces the amount and complexity of the code whilestill being able to generate all desired variants at compile time.18 . Coupling tinyMD with waLBerla
In this section, we briefly present the fundamental concepts behind waLBerlato understand its load balancing mechanism. The most important characteristicis its domain partitioning using a forest of octrees called block forest. This kindof partitioning allows us to refine blocks in order to manage and distributeregions with smaller granularities. Furthermore, we explain how this blockforest feature written in C++ is integrated into our tinyMD Impala code.waLBerla is a modern multi-physics simulation framework that supportsmassive parallelism of current peta- and future exascale supercomputers. Do-main partitioning in waLBerla is done through a forest of octrees. The leafnodes, so called blocks, can be coarsened or refined and distributed among dif-ferent processes. Figure 2 depicts the waLBerla forest of octrees data structurewith its corresponding domain partitioning.For each local block, waLBerla keeps track of its neighbor blocks and theprocess rank that owns it. This information is important to determine to whichprocesses we must communicate and with which blocks/subdomains we need tocompare our particle positions.To balance the simulation such that work is evenly distributed among theprocesses, it is first necessary to assign weights to each block in the domain.In this paper we use the number of particles located on a block as the weightof the block. The weight is not only used during the distribution of blocks, itis also used to determine whether a block should be refined (when it reachesan upper threshold) or merged (when it reaches a lower threshold). Differentalgorithms are available in waLBerla to distribute the workload. The algorithmscan be categorized into space-filling curves [29], graph partitioning and diffusiveschemes. In this paper we concentrate on space-filling curves, more specificallythe Hilbert [30] and Morton [31] (or z-order) curves.To couple applications and combine their functionality can save a lot of effortas one does not have to re-implement the functionality. In this paper, we choseto couple the load balancing implementation from waLBerla with tinyMD. As19 igure 2: Schematic 2D illustration of the waLBerla block forest domain partitioning. Atleft, the forest of octrees is depicted, where it is possible to observe the refinement of theblocks. At right, the domain partitioning for the correspondent block forest is depicted. On3D simulations, refined blocks produce 8 child blocks in the tree instead of 4. this part of the code does not have to be portable to different devices we canrely on an external framework. One could try to implement a portable codefor it, but the amount of work and complexity to do so may not be worththe benefits. Nevertheless, we can exploit AnyDSL to optimize our simulationkernels, memory management and other parts of the application that can beadvantageous, and then rely on existing implementations to avoid redoing workthat does not give benefits in the end.Since our Impala implementation turns out to be compiled to LLVM IR atsome point, we can link C/C++ code with it using Clang. Therefore, we writean interface from Impala and C++ in order to do the coupling. Section 4.3already presented how our communication code is integrated through the usageof routines that fetch information provided by waLBerla. In this section, wefocus on how these routines and other parts of the load balancing are written.The first part is to initialize the waLBerla data structures, the block forest20 nion crop
Figure 3: Transformation from waLBerla block forest domain to tinyMD regular AABB. Theunion of blocks is performed so all blocks that belong to the current process are converted totinyMD simple AABB structure, which only supports rectangular domains. Crop is performedto remove empty space and reduce the amount of allocated memory. On simulations whichthe empty space size is not significant, the crop operation can be skipped. is created using the bounding box information from tinyMD. When the blockforest for a process is adjusted due to the balancing, the tinyMD bounding boxis also changed, and since tinyMD does not have a block forest implementation,we just transform the waLBerla block forest into a simple AABB structure.This is done by performing an union of all blocks that belongs to the currentprocess. In the end, the domain can occupy a larger region than necessary, butthis does not affect the results, just the memory allocation.In some extreme cases where we only fill part of the global domain withparticles, the whole empty part of the domain can end up being assigned to asingle process, which turns it to allocate an amount of memory proportional tothe global domain. To mitigate this issue, we simply define a process boundingbox as enough to comprise its particles—or in other words, we crop the domain.Since tinyMD bounding box is only useful to define cells and distribute the par-ticles over them, this does not affect the correctness of the simulation. Figure 3depicts how the union and cropping transform the grid on tinyMD.To crop the AABB, a reduction operation is required. The following codeshows how this is performed using the reduce_aabb function abstracted bythe device: let b = AABB {xmin: grid.aabb.xmax,xmax: grid.aabb.xmin, / ... analogous to y and z }; let red_aabb_fn = @|aabb1: AABB, aabb2: AABB| {AABB {xmin: select(aabb1.xmin < aabb2.xmin, aabb1.xmin, aabb2.xmin),xmax: select(aabb1.xmax > aabb2.xmax, aabb1.xmax, aabb2.xmax), // ... analogous to y and z }}; let aabb = device().reduce_aabb(grid.nparticles, b, red_aabb_fn, @|i| { let pos = particle.get_position(i);AABB {xmin: pos.x,xmax: pos.x // ... analogous to y and z }}); Note that reduce_aabb is device-specific and is optimally implementedon GPU, which also demonstrates benefits obtained through the
Device ab-straction presented on Section 4.1. One can also observe how simple it is withAnyDSL to execute device code with non-primitive data types, in this case the
AABB structure. Furthermore, reduction can also be used to count the num-ber of particles within a domain, which is useful to obtain the weights for theload-balancing, the following code shows how it is written in tinyMD: let sum = @|a: i32 , b: i32 | { a + b };computational_weight = device().reduce_i32(nparticles, 0, sum, |i| {select(is_within_domain(particle.get_position(i), aabb), 1, 0)}) as u32 ;communication_weight = device().reduce_i32(nghost, 0, sum, |i| {select(is_within_domain(particle.get_position(nparticles+i), aabb), 1, 0)}) as u32 ; Next step for the coupling is to provide a function in tinyMD to update theneighborhood of a process. We first build the list of neighbors using the waL-Berla API in C++ and then call a tinyMD function to update the neighborhooddata in tinyMD. An important part of the process is to perform a conversionfrom the dynamic C++ container data types to simple arrays of real_t (in22ase of the boundaries) and integers (in case of neighbor processes ranks andnumber of blocks per neighbor process). This conversion is necessary because(a) Impala does not support this dynamic types from C++ and (b) code thatexecutes on GPU, more specifically the particle position checking presented onSection 4.3 also does not support these dynamic data types.Finally, it is also necessary to perform serialization and de-serialization ofour particle data during the balancing step. This is required because the blocksdata must be transferred to their new process owners during the distribution.Since waLBerla is an extensible framework, it provides means to call customprocedures when a block must be moved from or to another process. This allowus to implement (de)serialization functions on tinyMD, which also updates thelocal particles that are exchanged. The communication in these routines isentirely handled by waLBerla, hence no MPI calls are performed by tinyMD.
6. Evaluation
We evaluated tinyMD as well as miniMD on several CPU and GPU archi-tectures. We chose the following CPUs:
Cascade Lake:
Intel(R) Xeon(R) Gold 6246 CPU @ 3.30 GHz
Skylake:
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40 GHz
Broadwell:
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30 GHzAnd the following GPUs:
Pascal:
GeForce GTX 1080 ( 8 GB memory)
Turing:
GeForce RTX 2080 Ti (11 GB memory)
Volta:
Tesla V100-PCIe-32GB (32 GB memory)We ran each simulation over 100 time steps—each time step with 0.005. Weperformed particle distribution over cells and reneighboring every 20 time-stepswith double-precision floating point and full neighbor interaction. We use theLennard-Jones potential with parameters (cid:15) = 1 and σ = 1 (see Equation 2).The particles setup in tinyMD is the same as in miniMD, as well as the cutoffradius of 2.5 and the Verlet buffer of 0.3.23 roadwell Cascade Lake Skylake A A A t i m e t o s o l u t i o n ( s ) ForceNeighOther A: miniMD (ref) B: miniMD (Kokkos) C: tinyMD (AoS) D: tinyMD (SoA) E: tinyMD (AoSoA)Broadwell Cascade Lake Skylake B B B t i m e t o s o l u t i o n ( s ) Broadwell Cascade Lake Skylake
C C C t i m e t o s o l u t i o n ( s ) Broadwell Cascade Lake Skylake
D D D t i m e t o s o l u t i o n ( s ) Broadwell Cascade Lake Skylake
E E E t i m e t o s o l u t i o n ( s ) Figure 4: Execution time in seconds (lower is better) for force calculation and neighbor list cre-ation a 100 time-steps simulation on CPU architectures. Simulations were performed with unit cells with 4 particles per unit cell. Tests were performed in a single core (no parallelism)with fixed frequency. For AVX (Broadwell), AnyDSL emitted code with worse performancedue to data gather and scatter operations, therefore results for scalar instructions are shown. For multi-node benchmarks, particles are exchanged with neighbor ranks ateach 20 time-steps before distribution over cells and reneighboring, and com-munications to update the particle positions within ghost layers are done everytime-step with neighbor ranks.For CPU, tests were performed in a single core (no parallelism) with fixedfrequency, we simulated a system configuration of unit cells with 4 particlesper unit cell. For tinyMD, the Impala compiler generates Thorin code thatis further compiled with Clang/LLVM 8.0.0. For miniMD, we use the IntelCompiler (icc) 18.0.5.On GPU benchmarks we simulate a system configuration of unit cellswith 4 particles per unit cell ( . . particles in total). Both versions use theCUDA compilation tools V9.2.148, with CUDA driver version 10.2 and NVRTCversion 9.1, for miniMD the Kokkos variant is required to execute on GPUs.Figure 4 depicts the force calculation and neighbor list creation times fortinyMD and miniMD for the CPU architectures. For AVX Broadwell architec-ture, tinyMD produced poor vectorized code because of the gather and scatter24 ascal Turing Volta A A A t i m e t o s o l u t i o n ( s ) ForceNeighOther A: miniMD (Kokkos) B: tinyMD (AoS) C: tinyMD (SoA) D: tinyMD (AoSoA)Pascal Turing Volta B B B t i m e t o s o l u t i o n ( s ) Pascal Turing Volta
C C C t i m e t o s o l u t i o n ( s ) Pascal Turing Volta
D D D t i m e t o s o l u t i o n ( s ) Figure 5: Execution time in seconds (lower is better) for force calculation and neighbor listcreation a 100 time-steps simulation on GPU architectures. Simulations were performed with unit cells with 4 particles per unit cell. operations, and therefore the results for scalar operations are used, that is whythe miniMD performance is much better for this architecture. Note that thispairwise interactions kernels do not provide a straightforward way for vectoriza-tion, hence the compiler requires sophisticated vectorization analysis algorithmsto achieve good performance within CPU cores. For the AVX512 processors Cas-cade Lake and Skylake, AnyDSL produced more competitive code compared tominiMD, although the generated code is still inferior than the one generated bythe Intel compiler.This demonstrates a limitation for AnyDSL, as the auto-vectorizer is notcapable of generating the most efficient variant for the potential kernels, thiscan be due either because of (a) the compiler itself (AnyDSL/Impala code mustbe compiled with Clang) or (b) the transformations performed by the AnyDSLcompiler.In all of the cases, however, it is possible to notice that the neighbor listcreation performance is better for tinyMD (on AVX512 processors, it outper-forms miniMD by more than a factor of two). This can be a result of both(a) specialized code generation for both half- and full-neighbor lists creation25nd (b) usage of vectorization instructions to check for particle distances. Asfor the data layout, the structure of arrays is the best layout for the particle Vector3D data on CPU. This data layout enhances locality for data in thesame dimension, and therefore can improve the speedup to gather data into thevector registers.Figure 5 depicts the force calculation and neighbor list creation times fortinyMD and miniMD for the GPU architectures. For Pascal and Volta archi-tectures, all tinyMD variants outperform miniMD. It is also noticeable that theperformance advantage comes from the force compute time, which demonstratestinyMD can generate performance-portable MD kernels to different devices. ForTuring architecture, the performance for miniMD is slightly faster than tinyMDslower variants, and the fastest one (array of structures) moderately outperformsminiMD. The time difference for distinct
Vector3D arrays data layout on GPUis mostly concentrated in the neighbor list creation, where we can observe thearray of structures delivers the best performance.
For the CPU weak scalability tests we chose the same miniMD setup usedfor the single core evaluation. For every node involved a system of unit cellswas included into the simulation domain. Hence, the total number of particlessimulated is . . × number_of_nodes.The tests were executed on the SuperMUC-NG supercomputer. MPI ranksare mapped to the cores of two Intel Skylake Xeon Platinum 8174 processors(24 physical cores each) on each node. The processors are accompanied by atotal of
96 GB of main memory. All physical cores were used resulting in 48MPI ranks per node.Figure 6 depicts the time to solution for both tinyMD and miniMD executingon different number of nodes. Although miniMD delivered superior results onour single core experiments, we can notice that for this configuration tinyMDpresented a better performance. This happened because for this configurationthe neighbor lists creation and communication times overcame the faster force26 ,
024 2 , t i m e t o s o l u t i o n ( s ) tinyMDminiMD Figure 6: Weak-scaling comparison between tinyMD and miniMD for CPU on SuperMUC-NG. For 256 and 1024 nodes miniMD crashed with memory violation errors and, hence, resultswere interpolated. field kernel in miniMD. tinyMD keeps perfect scaling for all amount of nodes,whereas miniMD starts to degrade its parallel efficiency over 512 nodes. There-fore tinyMD provides very competitive weak-scaling results in comparison tostate-of-the-art applications.We performed the experimental tests for GPU weak-scalability on the Piz Daintsupercomputer using the XC50 compute nodes. Each node consists of a NVIDIATesla P100 16GB GPU, together with an Intel Xeon E5-2690 v3 @ 2.60GHz pro-cessor with 12 cores, and
64 GB of main memory. Each MPI rank is mappedto a GPU in our simulation—so one rank per node. For each GPU, a sys-tem of unit cells was included in the simulation domain, resulting in a total of . × number_of_gpus particles.Figure 7 depicts the time to solution for tinyMD on GPU. From 1 to 4 nodes,tinyMD executes faster than for more nodes, which is expected because with lessthan 8 nodes, there is no remote communication in all directions. Since GPUcompute kernels are much faster compared to CPU, remote communication con-sumes a bigger fraction of the total time and this can affect the scalability. From8 to 1024 nodes where remote communication is performed in all directions, itis reasonable to state that tinyMD presented perfect scalability.27 , . . . . t i m e t o s o l u t i o n ( s ) tinyMD Figure 7: Weak-scaling results for tinyMD on the Piz Daint supercomputer. Each process ismapped to a node with one NVIDIA Tesla P100 16GB GPU.
For miniMD, we were not able to produce weak-scalability results becausewe could not manage to compile it properly on the Piz Daint GPU cluster. Thecurrent compiler versions available in the cluster delivered error messages duringcompilation and the ones that were able to compile generated faulted versionsof miniMD, which delivered unclear MPI errors during runtime. Nevertheless,the presented results for tinyMD are enough to demonstrate its weak-scalingcapability on GPU clusters.
For the load balancing experiments, we also execute our tests on SuperMUC-NG with 48 CPU cores per node. In this experiments, the Spring-Dashpotcontact model was used, with the stiffness and damping values set to zero, whichmeans that particles keep static during the simulation. In order to measure thebalancing efficiency, we distribute the particles evenly in half of the domain,using a diagonal axis to separate the regions with and without particles. Whenmore nodes are used in the simulation, the domain is also extended in the sameproportion, hence the proportion of particles to nodes remains the same.We perform 1000 time steps during the simulation, and the load balancing isperformed before the simulation begins. This performs a good way to measure28 ,
024 2 , t i m e t o s o l u t i o n ( s ) MortonHilbertImbalanced
Figure 8: Load-balancing results for tinyMD on SuperMUC-NG with the Spring-Dashpotcontact model. For each node involved, the domain is extended by a system of unit cells,and particles are then distributed through half of the domain using a diagonal axis. the load balancing efficiency because we can expect an improved speedup closeto a factor of two. We use a system of unit cells per node (48 cores), with . particles per node. We performed experiments with Hilbert and Mortonspace-filling curves methods to balance the simulation.Figure 8 depicts the time to solution for the load balancing experiments.Both balanced and imbalanced simulations scale well during the experiments,and it is also possible to notice the performance benefit from the balancedsimulations. The benefit as previously mentioned is close to a factor of two, andthe difference between Morton and Hilbert methods is not too significant, withHilbert being more efficient at some node sizes.Although the load balancing feature works on GPU simulations, the commu-nication code using waLBerla domain partitioning takes a considerable fractionof their time. Therefore a different strategy on this communication code is nec-essary for GPU accelerators to reduce the fraction of the communication timecompared to the potential kernels and then achieve benefits from load-balancing.For this reason, we just show the experimental results on SuperMUC-NG forCPU compute nodes. 29 . Conclusion This paper presents tinyMD: an efficient, portable, and scalable implemen-tation of an MD application using the AnyDSL partial evaluation framework.To evaluate tinyMD, we compare it with miniMD, implemented in C++ thatrelies on the Kokkos library to be portable to GPU accelerators. We discussthe implementation differences regarding code portability, data layout and MPIcommunication.To achieve performance-portability on most recent processors and super-computers, we provide abstractions in AnyDSL that allow our application to bemapped to distinct hardware and be compiled with different data layouts. Allthis can be done with zero overhead due to partial evaluation, which is one ofthe main advantages we get when using AnyDSL.Moreover, we also couple our application with the waLBerla multi-physicsframework to use its load balancing mechanism within tinyMD simulations.This emphasizes how our Impala code can be coupled to other implementations,avoiding doing work that otherwise would not generate benefits. Furthermore,we show how this coupling can be made easier when using higher-order functionsin tinyMD communication code to abstract the communication pattern. Thispermit us to insert the waLBerla communication logic into tinyMD by justreplacing functionality.Performance results for tinyMD on single CPU and GPU show that it is com-petitive with miniMD on both single CPU cores and single GPU accelerators,as well as on supercomputer running on several compute nodes. Weak scalabil-ity results on the SuperMUC-NG and Piz Daint supercomputers demonstratethat tinyMD achieves perfect scaling on top supercomputers for the presentedbenchmarks. The load-balancing results on SuperMUC-NG demonstrate thatour strategy for coupling tinyMD and waLBerla works as expected, since thebalanced simulations in our experiments reach a speedup close to a factor of twocompared to the imbalanced simulations when filling just half of the domain.30 eferences [1] H. C. Edwards, C. R. Trott, Kokkos: Enabling performance portabilityacross manycore architectures, in: 2013 Extreme Scaling Workshop (xsw2013), 2013, pp. 18–24. doi:10.1109/XSW.2013.7 .[2] M. Bauer, S. Eibl, C. Godenschwager, N. Kohl, M. Kuron, C. Rettinger,F. Schornbaum, C. Schwarzmeier, D. Thönnes, H. Köstler, U. Rüde, wal-berla: A block-structured high-performance framework for multiphysicssimulations, Computers and Mathematics with Applications doi:https://doi.org/10.1016/j.camwa.2020.01.007 .[3] C. Godenschwager, F. Schornbaum, M. Bauer, H. Köstler, U. Rüde, Aframework for hybrid parallel flow simulations with a trillion cells in com-plex geometries, in: Proceedings of the International Conference on HighPerformance Computing, Networking, Storage and Analysis, SC ’13, As-sociation for Computing Machinery, New York, NY, USA, 2013, pp. 1–12. doi:10.1145/2503210.2503273 .[4] F. Schornbaum, U. Rüde, Massively parallel algorithms for the lattice boltz-mann method on nonuniform grids, SIAM Journal on Scientific Computing38 (2) (2016) C96–C126. doi:10.1137/15M1035240 .[5] P. A. Cundall, O. D. L. Strack, A discrete numerical model for granular as-semblies, Géotechnique 29 (1) (1979) 47–65. doi:10.1680/geot.1979.29.1.47 .[6] J. Schmitt, H. Köstler, J. Eitzinger, R. Membarth, Unified code generationfor the parallel computation of pairwise interactions using partial evalua-tion, in: 2018 17th International Symposium on Parallel and DistributedComputing (ISPDC), 2018, pp. 17–24. doi:10.1109/ISPDC2018.2018.00012 .[7] D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, H. J. C.31erendsen, GROMACS: fast, flexible, and free, Journal of ComputationalChemistry 26 (16) (2005) 1701–1718. doi:10.1002/jcc.20291 .[8] M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, E. Lin-dahl, Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers, SoftwareX 1-2 (2015) 19– 25. doi:https://doi.org/10.1016/j.softx.2015.06.001 .[9] S. Páll, M. J. Abraham, C. Kutzner, B. Hess, E. Lindahl, Tackling exascalesoftware challenges in molecular dynamics simulations with gromacs, in:S. Markidis, E. Laure (Eds.), Solving Software Challenges for Exascale,Springer International Publishing, Cham, 2015, pp. 3–27.[10] S. Páll, B. Hess, A flexible algorithm for calculating pair interactions onsimd architectures, Computer Physics Communications 184 (12) (2013)2641 – 2650. doi:10.1016/j.cpc.2013.06.003 .[11] S. Plimpton, Fast parallel algorithms for short-range molecular dynamics,Journal of Computational Physics 117 (1) (1995) 1 – 19. doi:10.1006/jcph.1995.1039 .[12] W. M. Brown, A. Kohlmeyer, S. J. Plimpton, A. N. Tharrington, Imple-menting molecular dynamics on hybrid high performance computers – par-ticle–particle particle-mesh, Computer Physics Communications 183 (3)(2012) 449 – 459. doi:https://doi.org/10.1016/j.cpc.2011.10.012 .[13] M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko, D. K. Panda, Scalableminimd design with hybrid MPI and openshmem, in: Proceedings of the8th International Conference on Partitioned Global Address Space Pro-gramming Models, PGAS 2014, Eugene, OR, USA, October 6-10, 2014,2014, pp. 24:1–24:4. doi:10.1145/2676870.2676893 .[14] L. Rieber, S. Mahony, minimds: 3d structural inference from high-32esolution hi-c data, Bioinformatics 33 (14) (2017) i261–i266. doi:10.1093/bioinformatics/btx271 .[15] S. Eibl, U. Rüde, A modular and extensible software architecture for parti-cle dynamics, Proceedings of the 8 th International Conference on DiscreteElement Methods (DEM8).[16] S. Eibl, U. Rüde, A local parallel communication algorithm for polydisperserigid body dynamics, Parallel Computing 80 (2018) 36–48. doi:10.1016/j.parco.2018.10.002 .[17] S. Eibl, U. Rüde, A systematic comparison of runtime load balancing algo-rithms for massively parallel rigid particle dynamics, Computer PhysicsCommunications 244 (2019) 76–85. doi:10.1016/j.cpc.2019.06.020 .[18] R. Salomon-Ferrer, D. A. Case, R. C. Walker, An overview of the amberbiomolecular simulation package, WIREs Computational Molecular Science3 (2) (2013) 198–210. doi:10.1002/wcms.1121 .[19] B. Brooks, C. Brooks, A. MacKerell, L. Nilsson, R. Petrella, B. Roux,Y. Won, G. Archontis, C. Bartels, S. Boresch, A. Caflisch, L. Caves, Q. Cui,A. Dinner, M. Feig, S. Fischer, J. Gao, M. Hodoscek, W. Im, M. Karplus,CHARMM: The biomolecular simulation program, Journal of Computa-tional Chemistry 30 (10) (2009) 1545–1614. doi:10.5167/uzh-19245 .[20] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa,C. Chipot, R. D. Skeel, L. Kalé, K. Schulten, Scalable molecular dynamicswith namd, Journal of Computational Chemistry 26 (16) (2005) 1781–1802. doi:10.1002/jcc.20289 .[21] S. Karol, T. Nett, J. Castrillon, I. F. Sbalzarini, A domain-specific languageand editor for parallel particle methods, ACM Trans. Math. Softw. 44 (3). doi:10.1145/3175659 . 3322] J. Gibbons, N. Wu, Folding domain-specific languages: Deep and shallowembeddings (functional pearl), SIGPLAN Not. 49 (9) (2014) 339–347. doi:10.1145/2692915.2628138 .[23] R. Leißa, K. Boesche, S. Hack, R. Membarth, P. Slusallek, Shallow em-bedding of dsls via online partial evaluation, ACM SIGPLAN Notices 51(2015) 11–20. doi:10.1145/2936314.2814208 .[24] R. Leißa, K. Boesche, S. Hack, A. Pérard-Gayot, R. Membarth, P. Slusallek,A. Müller, B. Schmidt, Anydsl: a partial evaluation framework for program-ming high-performance libraries, PACMPL 2 (OOPSLA) (2018) 119:1–119:30. doi:10.1145/3276489 .[25] R. Leißa, M. Köster, S. Hack, A graph-based higher-order intermediaterepresentation, in: Proceedings of the 13th Annual IEEE/ACM Inter-national Symposium on Code Generation and Optimization, CGO 2015,San Francisco, CA, USA, February 07 - 11, 2015, 2015, pp. 202–212. doi:10.1109/CGO.2015.7054200 .[26] C. Consel, New insights into partial evaluation: the SCHISM experiment,in: ESOP ’88, 2nd European Symposium on Programming, Nancy, France,March 21-24, 1988, Proceedings, 1988, pp. 236–246. doi:10.1007/3-540-19027-9\_16 .[27] L. Verlet, Computer "experiments" on classical fluids. i. thermodynamicalproperties of lennard-jones molecules, Phys. Rev. 159 (1967) 98–103. doi:10.1103/PhysRev.159.98 .[28] S. Moll, S. Hack, Partial control-flow linearization, SIGPLAN Not. 53 (4)(2018) 543–556. doi:10.1145/3296979.3192413 .[29] M. Bader, Space-filling Curves - An Introduction with Applicationsin Scientific Computing, Vol. 9, Springer, 2013. doi:10.1007/978-3-642-31046-1doi:10.1007/978-3-642-31046-1