RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles
Nicholas P. Bailey, Trond S. Ingebrigtsen, Jesper Schmidt Hansen, Arno A. Veldhorst, Lasse Bøhling, Claire A. Lemarchand, Andreas E. Olsen, Andreas K. Bacher, Lorenzo Costigliola, Ulf R. Pedersen, Heine Larsen, Jeppe C. Dyre, Thomas B. Schrøder
aa r X i v : . [ phy s i c s . c o m p - ph ] N ov SciPost Physics Submission
RUMD: A general purpose molecular dynamics package optimized toutilize GPU hardware down to a few thousand particles
Nicholas P. Bailey , Trond S. Ingebrigtsen , Jesper Schmidt Hansen , Arno A. Veldhorst , LasseBøhling , Claire A. Lemarchand , Andreas E. Olsen , Andreas K. Bacher , Lorenzo Costigliola ,Ulf R. Pedersen , Heine Larsen , Jeppe C. Dyre , Thomas B. Schrøder “Glass and Time”, IMFUFA, Dept. of Science and Environment, Roskilde University, Roskilde,Denmark*[email protected] **[email protected] 15, 2017 Abstract
RUMD is a general purpose, high-performance molecular dynamics (MD) simulation packagerunning on graphical processing units (GPU’s). RUMD addresses the challenge of utilizing themany-core nature of modern GPU hardware when simulating small to medium system sizes(roughly from a few thousand up to hundred thousand particles). It has a performance that iscomparable to other GPU-MD codes at large system sizes and substantially better at smallersizes. RUMD is open-source and consists of a library written in C++ and the CUDA extension toC, an easy-to-use Python interface, and a set of tools for set-up and post-simulation data analysis.The paper describes RUMD’s main features, optimizations and performance benchmarks.
Contents N
87 Neighbor-list generation: Order- N
118 Benchmarks and performance analysis 129 Electrostatics 1510 Summary 1611 Appendix: The autotuner 16 ciPost Physics SubmissionReferences 18 This paper describes the Roskilde University Molecular Dynamics (RUMD) package. RUMD is amolecular dynamics [1, 2] (MD) code running on Graphical Processing Units (GPU’s) from NVIDIA.RUMD was developed to achieve good performance at small and medium system sizes, while re-maining competitive with other GPU-MD codes at large sizes. The attention paid to small sizesdistinguishes RUMD from many other GPU-MD codes. It has been in development since 2008, andavailable as open-source software [3], since 2011. The newest version 3.3, was released in June 2017.The rise of GPU-based computation has been discussed by various authors [4–10]. MD is a goodcandidate for GPU-acceleration, as discussed by Stone et al. [6], since it involves a reasonably higharithmetic intensity, that is number of floating point operations per memory access. Several groupshave developed MD codes based on GPUs from scratch or incorporated GPU-acceleration into existingprojects. Examples of the former include HOOMD-Blue [11–14], ACEMD [15], OpenMM [16, 17]and HAL’s MD [18] while the latter include NAMD [19], LAMMPS [20], AMBER [21, 22] andGromacs [23] and GENESIS [24]. Other works involving GPU-based MD codes, going back to 2007,can be found in Refs. [25–36]. We omit a detailed exposition of GPU programming basics here; fora good overview of massive multi-threading using CUDA see the relevant section in the article byAnderson et al. [11] For further information the reader can consult the book by Kirk and Hwu [37] aswell as the CUDA programming guide [38].The large computational power of modern GPUs comes primarily from the large number of hard-ware cores, each executing a number of software threads. Here we focus on two of the most recentarchitectures, Kepler (2012) [39] and Pascal (2016) [40]. As an example, the GeForce Gtx780Ticard (Kepler architecture) has 2880 cores and a theoretical single-precision peak-performance of 5.0TFlops ( × floating point operations per second). A key element to achieve good performancefrom a GPU is that the number of active software threads should be much larger than the number ofhardware cores in order to hide latency of memory access. This makes it a challenge to utilize theGPU hardware when the number of particles N is relatively small ( N ∼ − ). The obviouschoice for parallelization, namely having one thread compute the forces for one particle, is clearly notefficient when the optimal number of threads exceeds the number of particles. There are three reasonsto focus on utilizing the GPU hardware even at small system sizes; i) Simulating long time scalesrather than large systems. This is of interest, for example, in the field of glass-forming liquids. Herea system size of particles is considered large, but the interest is in studying as long time scales aspossible. Note that finite size effects are relatively limited in these systems; for example Karmakar,Dasgupta, and Sastry [41] found convergence of diffusivity and relaxation time for a standard modelglass-former already at N=1000. ii) As a building block for multi-GPU simulations [42] (RUMD cur-rently uses one GPU per simulation). If one wants to simulate, say, particles using 10 GPU’s, thesingle-GPU performance obviously needs to be good for particles. iii) Much of the future devel-opment in GPU and other many-core hardware will probably be in increasing the number of physicalcores. Thus, what might today be considered a large system, might in the future be considered asmall/medium sized system where special care needs to be taken to utilize the GPU hardware. Tooptimize the use of the hardware RUMD allows multiple threads per particle; this approach has also2 ciPost Physics Submission Import RUMDimport rumdfrom rumd.Simulation import Simulation
Create a simulation object, and import an initial configuration.sim = Simulation("start.xyz.gz")
Create a pair potential and associate it with the simulation objectpot = rumd.Pot_LJ_12_6(cutoff_method=rumd.ShiftedForce)pot.SetParams(i=0, j=0, Sigma=1.0, Epsilon=1.0, Rcut=2.5)sim.SetPotential(pot)
Create an integrator and associate it with the simulation objectitg = rumd.IntegratorNVT(timeStep=0.004, targetTemperature=1.0)sim.SetIntegrator(itg)
Run a simulation. Data are saved on disk and can be analyzed by a number of toolssim.Run(1000000)
Figure 1: Script showing the Python code needed to run a simple simulation, in this case a single-component Lennard-Jones fluid simulated at constant temperature 1.0 for one million time steps ofsize 0.004 (Lennard-Jones units). The number of particles and the density is determined by the initialconfiguration contained in the file start.xyz.gzbeen considered in two recent publications [43, 44]. Finally, we note a very recent paper describingthe use of large ensembles of MD simulations of small systems [45], an approach which would alsobe very much suited to running on GPUs.The paper is organized as follows. Section 2 contains a brief overview of RUMD’s features. Themain part of the paper focuses on the methods used for calculating the non-bonding pair interactionsand the generation of the neighbor-list. These are the most computationally demanding parts of anMD simulation and where RUMD distinguishes itself from most other GPU-MD codes. Section 3discusses the challenges of utilizing the GPU hardware at small system sizes, and section 4 givesan overview of the optimization strategies employed in RUMD. Section 5 describes the calculationof non-bonding pair-forces, while sections 6 and 7 describes two different methods for generatingthe neighbor-list. Section 8 provides benchmarks of RUMD in comparison to three different GPUextensions of LAMMPS [20], as well as an analysis of the effect of the different optimizations em-ployed in RUMD. Section 9 describes RUMD’s performance for electrostatic (Coulomb) interactions.Section 10 provides a short summary.
Below we list the main features of RUMD; for more information please see the tutorial and usermanual included with the software and available from the project’s website rumd.org . Python interface:
The user controls the software via a Python interface which allows simulationsof considerable complexity to be implemented straightforwardly. An example of a simple userPython-script is given in Fig. 1.
Pair potentials: ciPost Physics Submission are provided: simple truncation with no shift, truncation plus shift of the potential energy toensure continuity, and truncation plus shift of the pair force [46] to ensure its continuity (thiscorresponds to adding a linear term in the potential).
Other interactions:
Intramolecular interactions, including both rigid-body constraints [47, 48] andflexible terms—bond-stretching forces, angular forces and dihedral forces. Three-body angle-dependent non-bonding interactions [49]. Many-body interactions for metals based on effectivemedium theory [50].
Integrators:
NVE (Verlet/Leap-frog), NVT (Nos´e-Hoover), NPT [51], NVU (geodesics on the con-stant potential energy surface) [52, 53]. Couette shear flow using the SLLOD equations ofmotion and Lees-Edwards boundary conditions.
File formats:
Configurations are stored in the xyz format with extensions, compressed using gzip;data can be saved block-wise logarithmically in time for efficient use of disk space while al-lowing the study of a large range of time scales in a single simulation; molecular structure(bonds, angles and dihedrals) is specified in separate topology files. Tools for creating initialconfigurations and topology files are provided.
Analysis tools:
Basic statistics of energy, pressure, etc. for thermodynamics. Measures of structure:radial distribution function, static structure factor, radius of gyration, mean-square end-to-enddistance. Measures of dynamics: mean-square displacement, incoherent intermediate scatter-ing function, non-Gaussian parameter, end-to-end vector autocorrelation function, Rouse-modeautocorrelation function. New analysis tools are added regularly. Analysis tools work on datastored during simulations and can be applied at the end of or during a simulation. The user maydefine customized on-the-fly analysis tools written in Python.
Autotuner:
A script for optimizing internal parameters—specifically, the choice of algorithm forgenerating the neighbor list, the neighbor-list skin size, and the way the generation of the neigh-bor list and the calculation of non-bonding forces are distributed among the GPU threads. Theautotuner is described in the appendix.RUMD is mostly implemented in single precision. This leads to a drift in the total energy whenrunning long constant-energy (NVE) simulations, but is not an issue for NVT and NPT simulationswhere a thermostat is applied. RUMD can be made fully double precision by a search and replace inthe source code - we are planning to implement a more elegant way for the user to choose betweensingle and double precision. RUMD uses a single GPU per simulation; support for multiple GPUsimulations is planned for future development.
Consider NVIDIA’s Kepler GK110 architecture that appeared in 2013. One of the Kepler design goalswas power efficiency, which was partly achieved by increasing the number of cores while decreasingthe clock speed compared to the previous Fermi architecture. Thus each streaming multiprocessor (oftype SMX) has 192 cores, and the GPU has up to 15 SMX units. The GTX 780Ti card contains themaximum 15 SMX units, giving 2880 cores. Furthermore, the number of software threads needs tobe much larger than the number of physical cores, in order to hide memory access latency. This posesa challenge when small systems of the order of thousands of particles are concerned. In order to use4 ciPost Physics Submission
Sorting/cell−list algorithmOne thread perparticleNeighbor−list(fixed−size array)per particle (small N) (large N) Calculate forces.Many threads per Many threads particle(tp greater than 1) Calculate forces.Specializedimplementationfor tp=1
N algorithm (tp greater than 1)
Figure 2: Schematic diagram representation of the two algorithms for neighbor-list generation, and theforce calculation algorithm. The latter uses multiple threads per particle (tp), but an implementationalso exists for the special case tp=1. 5 ciPost Physics Submission as many threads as possible, one must therefore have multiple threads computing the force on oneparticle.Having multiple threads per particles entails some overhead, in particular the summing of theforce contributions over the threads allocated to a given particle. This means that as the system sizeincreases, it becomes less useful to have more than one thread per particle. We control this by theparameter t p (threads per particle, denoted TPerPart in the code), and let the autotuner pick theoptimal value for a given simulation. The optimal value of t p depends primarily on the number ofparticles, but also on density and the range of the potential. We use a separate kernel involving asingle thread per particle for larger sizes (see Fig. 2); this is faster than setting t p = 1 in the generalkernel.Rovigatti et al. have recently discussed the possible advantages of “vertex-based” (atom-decomposition[54], one thread per particle) versus “edge-based” (force-decomposition [54], one thread per interac-tion) parallelism [55]. Our approach includes the former and a range of intermediate cases, while nottaking it to the extreme of one thread per interaction. As in any general purpose MD software a data structure to keep track of neighbors for the non-bonding pair interactions is necessary to reduce the complexity of the force calculation from O ( N )to O ( N ) [1,2]. We use a classical Verlet-type neighbor list, stored as 2-dimensional fixed-size array ofsize N n max where n max is the assumed maximum number of neighbors per particle. If this happens tobe exceeded the neighbor-list is automatically re-allocated with doubled capacity. For smaller systemswe set n max = N from the start to avoid the overhead of checking for overflow. Neighbors within r c + s are listed, where r c is the maximum cut-off associated with the potential, and s is the extra skinincluded so that the neighbor-list does not need to be rebuilt every step. The optimal value of the skinis determined by the autotuner.We now describe the methods employed in the calculation of short-range non-bonding forces andthe generation of the neighbor-list. The main four optimizations are as follows:1. Multiple threads per particle ( t p ≥ ) in force calculation and neighbor-list generation. Theautotuner chooses the best value for t p .2. Two methods for rebuilding the neighbor-list: O ( N ) method ( t p ≥ ) for small system sizes,and an O ( N ) method ( t p = 1 ) for larger sizes. The autotuner picks the best method.3. Use of the so-called “read only data-cache” for reading positions (for NVIDIA devices of so-called “compute capability” [38] at least 3.5 this can be done straightforwardly via the function __ldg() ).4. Use of pre-fetching when reading from the neighbor-list to compensate for memory accesslatency. The force calculation kernel (routine executed on the GPU) is shown in Fig. 3. Short-hand notationfor common quantities used in this and the following CUDA-kernels are given in Table 1. The force6 ciPost Physics Submission
Table 1: Short-hand notation for common quantities used in CUDA-kernels.quantity name in kernel CUDA variableNumber of thread-blocks
NumBlocks gridDim.x
Particles per block ( p b ) PPerBlock blockDim.x
Threads per particle ( t p ) TPerPart blockDim.y
Particle index within block
MyP threadIdx.x
Thread index w.r.t. particle
MyT threadIdx.y
Index of thread-block
MyB blockIdx.x
Global index of particle
MyGP MyP+MyB*PPerBlock __global__ void
Calcf_NBL_tp( ... )[ Declare shared memory ]float4 my_f = {0.0f, 0.0f, 0.0f, 0.0f}; // Initialize the force of this thread float4 my_r = LOAD(r[MyGP]); // Read position of this particle int type_i = __float_as_int(my_r.w); // Type of this particle [ Read information on the simulation box from device memory ][ Copy potential parameters to shared memory ]__syncthreads(); // Parameters loaded to shared memory before proceeding int nb, my_num_nbrs = num_nbrs[MyGP]; // Read number of neighbors nb_prefetch = nbl[nvp*MyT + MyGP]; // Read index of first neighbor for ( int i=MyT; i ComputeInteraction ,is to be called. Templating is also used for some of the other user-chosen variables, including thetype of boundary conditions (represented by a SimulationBox class) and the cutoff-method. Thismeans that the force calculation kernel is compiled for all possible combinations of these parameters,and the user can choose the appropriate one at run time. The code for the conditional statements whichallows this is tedious, but is generated automatically by a Python script. The main disadvantage ofusing templating is that it increases the compile time considerably. N This neighbor-list generating algorithm (see Fig. 4) has O ( N ) complexity and is thus suitable onlyfor small system sizes. In a serial code there would be a double loop; in a parallel code one loop(over particles whose neighbors are to be found) is handled completely by parallelization. Part of theloop over “other” particles is handled by looping over t p -sized groups, while parallelization (the t p threads for that particle) accounts for looping within these groups (we do not make use of Newton’sthird law). Shared memory is used to reduce the amount of reads from device memory; in a straightforward implementation without shared memory a total of N reads of particle positions is necessary.By using a block-wise reading into the shared memory, this is reduced to N /p b , where p b is thenumber of particles in a block (denoted PPerBlock in the code). From this consideration p b shouldbe as large as possible, but on the other hand a too large p b value would mean that the number ofblocks ( ≈ N/p b ) becomes too small to utilize all the available SMX multiprocessors. RUMD uses theautotuner to pick the optimal value of p b .The kernel uses t p threads for a given particle to search for neighbors. This means that we have todeal with the situation that several or all of them find a neighbor at the same time, and the writing tothe neighbor list should be performed without race-conditions. This is achieved by a so-called atomic operation. When several threads perform an atomic operation on the same variable, all operations areguaranteed to be performed in (an unspecified) sequential order. Here we use the atomic incrementfunction, atomicInc() , which ensures that the number of neighbors is counted correctly. Whena thread calls atomicInc() , the function returns the value the variable had before the incrementof the given thread is performed. This is here used to specify a unique position in the neighbor list( nextNbrIdx ).The information about whether the neighbor-list needs to be rebuilt resides on the device, gener-ated by a different kernel. The kernel in Fig. 4 is called at every timestep and checks via if(updateRequired) whether there is anything to be done. This is faster than copying the value of a flag to the host andhaving the host decide whether to launch the rebuild-kernel. updateRequired is initially equal8 ciPost Physics Submission __global__ void calculateNBL_N2( ... ) { const unsigned int tid = MyP + MyT*PPerBlock; // Thread-index within block [ Declare shared memory: s_r, s_Count, s_cut_skin2 ] if (updateRequired[0]) { if (MyT==0) s_Count[MyP]=0; // Count of neighbors for this particle [ Copy cut-offs plus skin squared to shared memory]float4 my_r = r[MyGP]; // Position of this particle// Loop over blocks of particles for (FirstGP=0; FirstGP End = cellEnd[otherCellIndex]; for ( int OtherP=Start; OtherP<=End; OtherP++) { if (gtid != OtherP) {float4 r_i = LOAD(r[OtherP]);[ Read squared cutoff distance from shared memory based on types ][ Calculate squared distance dist2 ] if (dist2 < RcutSk2)[ If space insert index into NB-list and increment Count, else break ]}} // end for (int OtherP....) }}} // end for (int dZ ... ) [ Store this particles number of neighbors][ Store its position so can check when rebuild needed ][ Detect whether ran out of space and set flag to inform host] if ( gtid==0 ) updateRequired[0] = 0;} // if(gtid < numParticles) } Figure 5: Kernel for order-N neighbor-list generation. calculateCellCoordinates(...) calculates the coordinates of the cell that a particle belongs to. calculateCellIndex(..) cal-culates the index of a cell given its coordinates. The arrays cellStart and cellEnd contain theindices of the first and last particles, respectively, associated with a given cell.10 ciPost Physics Submission to the number of thread-blocks. One thread from each block decrements it with an atomic opera-tion ( atomicDec() ) when it (its thread-block) is done, so that when all blocks are finished, it iszero. At the next time step, assuming no particles have moved more than half the skin distance, updateRequired will still be zero and therefore the kernels immediately exit. Using an atomicoperation to decrement updateRequired is necessary because the thread-blocks execute asyn-chronously, so none of them knows when/whether the others are finished, or even started; any unfin-ished blocks need to see a non-zero value of the counter.The above means that for small systems the simulations are performed entirely on the GPU withoutany communication with the CPU (except when output is required). Avoiding the overhead associatedwith communication between the GPU and CPU is important for the performance at small systemsizes. N The order- N algorithm is based on a cell-index method [1, 2] and involves (1) dividing the simulationbox into rectangular spatial cells whose size is related to the potential cutoff; (2) associating particleswith the appropriate cell based on the coordinates; (3) sorting the particles according to cell-indexand rearranging all particle data to the sorted order (this can be done quickly with the Thrust library[56]). The advantage of rearranging the particle data to the sorted order is two-fold; i) the informationabout which particles are in a given cell can be stored simply as two integers indicating the first( cellStart ) and the last particle ( cellEnd ); ii) better performance of the data-cache when readingthe particle information both in the neighbor-list creation in the force calculation.The kernel in Fig. 5 is called after steps (1) to (3) have been carried out via a series of small kernelsand Thrust operations. It involves, for a given particle, identifying its cell coordinates and looping overneighboring cells in three dimensions to find neighbors. We have chosen cell lengths in each directionto be of order (not less than) ( r c + s ) / , where s is the neighbor-list skin. This means that the loopextends to plus/minus two cells in each direction, or 125 cells altogether. Such a choice of cell lengthmeans one searches a volume 58% of that searched when using cells of length r c + s . This kernel iscalled with one thread per particle, since that is generally most efficient at larger sizes, which is alsowhen the linear method of neighbor-list generation becomes relevant. It is conceivable that some gainat intermediate sizes could be achieved by implementing a t p > version of the kernel, but this hasnot been tried yet.In this Neighbor-list method the information about whether to rebuild the neighbor-list must becommunicated to the host because several kernels and Thrust functions must be called (the use ofDynamic Parallelism, available since CUDA 5.0, could change this, but has not been tried). Thus the updateRequired flag is not used in the kernel because the kernel only runs at all if a rebuild isrequired; the flag is simply set to zero at the end by the thread handling particle 0.11 ciPost Physics Submission Number of atoms T PS ( T i m e S t e p s p e r S ec ond ) LAMMPS: CPU 12 XEON coresLAMMPS: KOKKOS/CUDALAMMPS: USER-CUDALAMMPS: GPURUMD 3.0: Gtx780Ti RUMDLAMMPS P e rf ec t s ca li ng Figure 6: The Lennard-Jones LAMMPS benchmark: a melting FCC crystal is simulated at constantenergy. LAMMPS results are downloaded from the LAMMPS webpage (see text). The vertical axisshows the number of simulated time steps per second of wall-clock time. At large system sizes allcodes follows the ideal /N scaling, and the GPU’s are 10-20 faster than LAMMPS running on 12xeon cores. For RUMD good scaling is maintained down to quite small systems N ∼ , and atsmall system sizes RUMD is thus considerably faster than the three GPU versions of LAMMPS. To illustrate the performance of RUMD, we first compare to the “Lennard-Jones” benchmark resultspublished on the LAMMPS homepage (http://lammps.sandia.gov/bench.html N (above ∼ × ). In this regime near perfect scaling with N is observed, andthe GPU versions are 10-20 times faster than LAMMPS running on 12 Intel Xeon cores. At smallersystem sizes the near perfect scaling breaks down: for two of the GPU versions of LAMMPS (thered and blue curves) running a simulation with 2000 particles takes as much time as one with 20000particles; clearly the GPU hardware is under-utilized. In fact, for these two implementations it is fasterto use the pure CPU version of LAMMPS at the smallest system sizes. RUMD, on the other hand,maintains reasonable (though not perfect) scaling down to around N = 2000 . We have included evensmaller system sizes, to illustrate that RUMD eventually also begins to struggle to utilize the hardwareat very small system sizes.Table 2 gives the parameters chosen by the autotuner, as a function of system size. Except for the One must search for neighbors in a given particle’s own cell and one (two) neighboring cells for cell length r c + s ( r c + s/ ). In the former case the search volume is 27 ( r c + s ) ; in the latter it is 125 (( r c + s ) / giving a ratio(125/8)/27=0.58. ciPost Physics Submission N1.01.52.02.53.0 F ea t u r e e ff ec t with tp>=1with __ldg()with prefetch Figure 7: Analyzing the effect of optimization features in RUMD. The upper panel shows the perfor-mance of the full-RUMD and three other versions in which one feature has been disabled: multiplethreads per particle ( t p > ), use of read only data-cache to read positions ( ldg()), and pre-fetching.The lower panel shows the same data in terms of the relative boost in performance that each featuregives, as a function of system size. 13 ciPost Physics Submission Table 2: Performance parameters chosen by the autotuner and the resulting TPS (Timesteps Per Sec-ond) on a Gtx780Ti card (5.0 TFlops sp, 336GB/s).N NB pb tp skin TPS512 N 16 14 0.452 312011024 N 16 10 0.5 267292048 N 48 8 0.611 205044096 N 32 4 0.746 112228192 N 192 2 0.824 629216384 N 192 1 0.5 477232768 N 192 1 0.5 271065536 N 128 1 0.452 1458131072 N 192 1 0.409 771262144 N 128 1 0.409 382524288 N 128 1 0.370 1941048576 N 128 1 0.370 1002097152 N 96 1 0.335 50two smallest system sizes, the autotuner chooses the total number of threads ( N × t p ) to be at least16000. This illustrates the point made in the introduction, that the number of threads should be muchlarger than the number of physical cores (here 2880) to get good performance. The reason that fewerthreads are used for the two smallest system sizes is probably that the required large t p values inflicttoo large a penalty due to the sequential summation of the t p different contributions to the force (seeFig. 3). The switch between the two methods for neighbor-list generation happens at around 8000particles. In this range of system sizes both methods are sub-optimal and the autotuner compensatesby increasing the skin size to make neighbor-list updates less frequent.Figure 7 shows the effect of disabling different optimization features. The upper panel showstime steps per second like Fig. 6, but with different curves representing different disabled features (theblack curve is with all features enabled). The most dramatic difference is when t p = 1 is enforced, forsmall and medium systems ( N < ). No difference is observed at larger N because there t p = 1 isthe optimal choice, see table 2. Disabling the use of the read-only data-cache gives the green curve, asignificant drop in performance across all sizes except the very smallest N < , while disablingpre-fetching gives a slight drop, more at larger sizes. The lower panel of Fig. 7 shows the same data,but plotted as the ratio of the speed of the full RUMD to that of RUMD with the given feature disabled.Plotting this ratio, on a linear scale, shows the relative effects more clearly. In particular, reading viathe read only data-cache gives an effect of order 40%, while pre-fetching has an effect of order 10%at the large sizes.Table 3 gives results for RUMD running on a GTX 1080 card of the newer Pascal architecture.This card has a single precision theoretical peak-performance of 8.2 TFlops (8.9 TFlops in ’Boost’mode), and a memory bandwidth of 320 GB/s. These numbers are, respectively, 64% higher and 5% lower than the corresponding numbers for the Gtx 780Ti card. Table 3 shows speed-ups in the range13 to 56%. The largest speed-ups are found at N = 2048 and N = 4096 , where the N Nb-listmethod is chosen – this method has more calculations per read from memory (see section VI), sothe increased computational speed without increase in memory bandwidth can better be utilized. Itshould be noted that cards of the Pascal architecture with higher memory bandwidth have recently14 ciPost Physics Submission Table 3: Performance parameters chosen by the autotuner and the resulting TPS (Timesteps Per Sec-ond) on a Gtx1080 card (8.2 TFlops sp, 320GB/s). Last column is speed-up compared to the Gtx780Ti card (Table 2). N NB pb tp skin TPS SpeedUp512 N 16 4 0.335 35354 1.131024 N 16 5 0.370 34226 1.282048 N 64 8 0.5 31454 1.534096 N 32 4 0.55 17508 1.568192 N 32 1 0.611 8824 1.4016384 N 128 1 0.452 6617 1.332768 N 32 1 0.553 3075 1.1365536 N 128 1 0.452 1798 1.23131072 N 192 1 0.409 948 1.23262144 N 192 1 0.409 495 1.30524288 N 128 1 0.370 239 1.231048576 N 192 1 0.370 126 1.262097152 N 192 1 0.370 58 1.16been released — our results suggest that these should perform even better at the large system sizes. A general purpose MD code should include electrostatic (Coulomb) interactions and these shouldbe sufficiently accurate and computationally efficient. The smooth particle mesh Ewald method [22,58–60] which can efficiently handle the long range part of the electrostatic interactions is planned,but for our needs so far we have found it sufficient to use Coulomb forces with a large shifted-forcecut-off as documented in Ref. [61]. In that paper it was shown that a shifted-force cutoff of orderfive inter-particle spacings gives, similar to the Wolf [57] method, results in excellent agreement withEwald-based methods in bulk systems.To benchmark the performance of Coulomb interactions, we performed simulations of a modelmolten salt in RUMD and the CPU version of LAMMPS. All particles have identical Lennard-Jonesparameters ǫ and σ . The charges are ±√ πǫ ǫσ (50% each). The density is 0.3677 σ − and thetemperature 2 ǫ/k B . The density is the same as was used by Hansen and McDonald in their study ofa similar model salt [62]. In Ref. [61] it was shown that a cutoff of . σ , corresponding to ∼ neighbors per particle at this density, was sufficient to get satisfactory accuracy. The time step is 0.004 p m/ǫσ .The data in Table 4 compare RUMD and the CPU-version of LAMMPS with different methodsof evaluating Coulomb interactions. The benchmarks are expressed as MATS (million atom timesteps per second) for ease of comparing different system sizes. The smallest speed-up of RUMD overLAMMPS is here a factor of two. This smaller speed-up compared to the previous section is primarilydue to the LAMMPS benchmarks being run on a faster CPU system, a Dell 630 server with 36 XEONcores. For comparison, at the time of writing the cost of the Dell 630 server is roughly 10,000$,whereas the cost of the GTX 780 Ti card is around 500$ (to this should be added the cost of a fairly15 ciPost Physics Submission Table 4: Benchmarks for a system of charged Lennard-Jones particles (see text for details) forLAMMPS (CPU) and RUMD (Gtx 780Ti). The metric shown is MATS (millions of atom time stepsper second). LAMMPS benchmarks were performed on a Dell 630 server with dual Intel Xeon E5-2699 v3 2.3 GHz CPU’s for a total of 36 cores. Coulomb interactions were evaluated using a simpleshifted-potential cutoff at distance 6.0 (“SP”), using the Wolf method [57] with the switching pa-rameter α set to zero (“WOLF”, equivalent to the shifted force method), and the Particle-ParticleParticle-Mesh method (“PPPM”). RUMD benchmarks were performed on an nVidia GTX 780 Ti,using a shifted force cut-off (SF) at distance 6.0.N · · LAMMPS SP 3.26 5.66 6.06LAMMPS WOLF 2.21 3.57 3.40LAMMPS PPPM 0.54 0.40 0.35RUMD SF 9.41 11.3 16.1standard PC, which can hold three GPU cards). 10 Summary We have described the RUMD software package for molecular dynamics simulation on GPUs, con-centrating on the optimization strategies that distinguish it from most other GPU MD codes. We havedocumented its strong performance at small and medium system sizes and its performance compara-ble to other GPU-based MD codes at larger sizes. Work will continue on RUMD both with regard tofeatures and optimization opportunities. The ability to split a simulation over multiple GPUs will alsobe considered, which will not just allow larger systems, but also even faster simulations of mediumsystems, given that RUMD already makes good use of the hardware for such sizes. 11 Appendix: The autotuner Here we describe the algorithm used by the autotuner, which optimizes the choice of neighbor listalgorithm, the neighbor-list skin size, and the way the generation of the neighbor list and the calcula-tion of non-bonding forces are distributed among the GPU threads ( p b and t p ). The basic strategy isto run a series of short simulations (a few hundred to a few thousand time steps) varying the differ-ent parameters, to find a set of of parameters giving (close to) optimal performance. Not all possiblecombinations of parameters are attempted in order to reduce the time taken for tuning (for Lennard-Jones-type systems without molecules or Coulomb interactions the autotuner process takes under aminute for small systems, several minutes for larger systems). The initial state of the system is storedso that all comparisons made by the autotuner involve runs of the same length starting from the sameconfiguration.If the autotuner is not used, RUMD uses default values which depend on the system size: “n2”neighbor-list method (section 6) for N < , otherwise “sort” (section 7); p b = 32 , t p = 4 for N ≤ , otherwise p b = 64 , t p = 1 . The default skin value is 0.5, which assumes units of length16 ciPost Physics Submission such that the interparticle spacing is of order unity. In principle the default skin should be based onthe interparticle spacing (e.g. ρ − / / ), but in practise length units in MD are generally of orderthe interparticle spacing and the autotuner can quickly deal with a discrepancy. For some very smallsystems, N < , with not too small cutoff it can be faster not to use a neighbor-list at all. Theautotuner checks this possibility for systems with N < .The dependence of performance on the parameters p b , t p and neighbor-list skin is simple: the timetaken shows a single minimum as a function of the parameter. This allows a relatively straightforwardoptimization strategy to be used. The number of steps run for the different stages depends on thesystem size (larger for smaller systems sizes to get better timing) and can be altered by the user butshould not need to be. The overall strategy is as follows:1. Run some steps before tuning (default: 10000) to avoid the influence of transient effects (asso-ciated for example with having changed the temperature).2. Run with default parameters to get a baseline performance.3. Phase I optimization: With the default p b and t p run with the different neighbor-list methods,“none”, “n2”, “sort”. For each one the skin is optimized separately.4. Phase II optimization: For the fastest neighbor-list method and other methods within 20% of thefastest, optimize the parameters p b and t p using a double loop: first p b considering the values 16,32, 48, 64, 96, 128 and 192. For each p b , values of t p are tested starting from 1 and increasinguntil 64. For each combination of p b and t p the skin length is re-optimized starting at the lastidentified optimal value.5. If the neighbor-list method “n2” is included in phase II, it can still help to sort the particles onceevery few hundred times, typically for system sizes near the crossover from “n2” being optimalto “sort” being optimal. This is checked and the optimal sorting interval found.6. If more than one neighbor-list method was optimized in phase II, make a final comparison be-tween the phase II-optimized sorting methods to choose the overall optimized set of parameters(except close to the cross-over from one method to another, the phase II optimization does notchange which method is chosen, and in that case the difference is small anyway).7. Run, using the optimized parameters, the same number of steps as for the baseline run to deter-mine the overall improvement due to tuning.8. Write the tuned parameters to a file so that repeating the simulation in the same directory withthe same “user parameters” does not require re-tuning. For this purpose, “same user parameters”means: same number of particles of each type, same density and temperature and potentialparameters (within a tolerance), same integrator type and timestep (within tolerance), and sameGPU-type. The actual configuration does not have to be the same. If there is any doubt aboutre-using the previously found parameters, the file can just be deleted.Some further details are noted here: • The skin optimization starts from the default value (phase I) or previously identified phase I-optimal value (phase II). Its value is increased and decreased in steps of 20% (phase I) or 10%(phase II) until a minimum is identified in the time taken. Attempting to optimize the skin to aprecision of better than 10% is not worth the effort.17 ciPost Physics Submission • The loop over t p breaks out when one of the following three conditions is met: (i) the time takenexceeds the minimum time so far three times in a row; (ii) the time taken exceeds the minimumtime so far by 5%; (iii) some GPU resource-limit is exceeded, either the number of threads perblock ( p b t p ) or the total register count per block. • The loop over p b breaks out when the time taken (having optimized t p and skin) exceeds theprevious best by 10% or more. • For very large systems it doesn’t make sense to use anything other than the “sort” method for theneighbor-list. The autotuner omits “none” for N > and “n2” for N > . Moreover,large systems generally require larger p b and so the autotuner omits p b = 16 for N > and p b = 32 for N > . Also for the largest systems only t p = 1 is relevant; the autotuner omitschecking other values for N > . Acknowledgements A large part of this work was sponsored by the Danish National Research Foundation’s grant DNRF61. References [1] M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids , Oxford University Press (1987).[2] D. Frenkel and B. Smit, Understanding Molecular Simulation: From Algorithms to Applications ,Acedemic Press (1996).[3] RUMD software is freely available at http://rumd.org.[4] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krueger, A. E. Lefohn and T. J. Purcell, A survey of general-purpose computation on graphics hardware , Comput. Graph Forum (1),80 (2007), doi:10.1111/j.1467-8659.2007.01012.x.[5] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone and J. C. Phillips, GPU computing ,Proc. IEEE (5), 879 (2008), doi:10.1109/JPROC.2008.917757.[6] J. E. Stone, D. J. Hardy, I. S. Ufimtsev and K. Schulten, GPU-accelerated molecular modelingcoming of age , J. Mol. Graphics Modell. (2), 116 (2010), doi:10.1016/j.jmgm.2010.06.010.[7] J. Nickolls and W. J. Dally, The GPU Computing Era , IEEE MICRO (2), 56 (2010).[8] R. M. Farber, Topical perspective on massive threading and parallelism , J. Mol. Graph. Model. , 82 (2011), doi:10.1016/j.jmgm.2011.06.007.[9] A. Harju, T. Siro, F. F. Canova, S. Hakala and T. Ratalaiho, Computational physics on graphicsprocessing units , In: Manninen P., Oster P. (eds) Applied Parallel and Scientific Computing.PARA 2012. Lecture Notes in Computer Science, vol 7782. Springer, Berlin, Heidelberg (2013).[10] M. J. Harvey and G. De Fabritiis, A survey of computational molecular science usinggraphics processing units , Wiley Interdiscip. Rev.-Comput. Mol. Sci. (5), 734 (2012),doi:10.1002/wcms.1101. 18 ciPost Physics Submission [11] J. A. Anderson, C. D. Lorenz and A. Travesset, General purpose molecular dynamics simula-tions fully implemented on graphics processing units , J. Comput. Phys. (10), 5342 (2008),doi:10.1016/j.jcp.2008.01.047.[12] J. A. Anderson and A. Travesset, Molecular Dynamics on Graphic Processing Units: HOOMDto the rescue , Comput. Sci. Eng. (6), 8+ (2008).[13] P. K. Jha, R. Sknepnek, G. I. Guerrero-Garcia and M. O. de la Cruz, A Graphics ProcessingUnit Implementation of Coulomb Interaction in Molecular Dynamics , J. Chem. Theory Comput. (10), 3058 (2010), doi:10.1021/ct100365c.[14] T. D. Nguyen, C. L. Phillips, J. A. Anderson and S. C. Glotzer, Rigid body constraints realized inmassively-parallel molecular dynamics on graphics processing units , Comput. Phys. Commun. (11), 2307 (2011), doi:10.1016/j.cpc.2011.06.005.[15] M. J. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerating Biomolecular Dy-namics in the Microsecond Time Scale , J. Chem. Theory Comput. (6), 1632 (2009),doi:10.1021/ct9000685.[16] P. Eastman and V. S. Pande, Efficient Nonbonded Interactions for Molecular Dynamics on aGraphics Processing Unit , J. Comput. Chem. (6), 1268 (2010), doi:10.1002/jcc.21413.[17] P. Eastman, M. S. Friedrichs, J. D. Chodera, R. J. Radmer, C. M. Bruns, J. P. Ku, K. A.Beauchamp, T. J. Lane, L.-P. Wang, D. Shukla, T. Tye, M. Houston et al. , OpenMM 4: AReusable, Extensible, Hardware Independent Library for High Performance Molecular Simula-tion , J. Chem. Theory Comput. (1), 461 (2013), doi:10.1021/ct300857j.[18] P. H. Colberg and F. H ¨ofling, Highly accelerated simulations of glassy dynamics using GPUs:Caveats on limited floating-point precision , Comput. Phys. Commun. (5), 1120 (2011),doi:10.1016/j.cpc.2011.01.009.[19] J. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. Skeel, L. Kaleand K. Schulten, Scalable molecular dynamics with NAMD , J. Comput. Chem. (16), 1781(2005), doi:10.1002/jcc.20289.[20] W. M. Brown, P. Wang, S. J. Plimpton and A. N. Tharrington, Implementing molecular dynamicson hybrid high performance computers - short range forces , Comput. Phys. Commun. (4),898 (2011), doi:10.1016/j.cpc.2010.12.021.[21] R. Salomon-Ferrer, D. A. Case and R. C. Walker, An overview of the Amber biomolecularsimulation package , WIREs Comput. Mol. Sci. (2012), doi:10.1002/wcms.1121.[22] R. Salomon-Ferrer, A. W. Goetz, D. Poole, S. Le Grand and R. C. Walker, Routine Microsec-ond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle MeshEwald , J. Chem. Th. Comput. , 3878 (2013), doi:10.1021/ct400314y.[23] S. P´all, M. J. Abraham, C. Kutzner, B. Hess and E. Lindahl, Tackling Exascale Software Chal-lenges in Molecular Dynamics Simulations with GROMACS , LNCS , 3 (2015), Proceedingsof EASC 2014. 19 ciPost Physics Submission [24] J. Jung, A. Naurse, C. Kobayashi and Y. Sugita, Graphics Processing Unit Acceleration andParallelization of GENESIS for Large-Scale Molecular Dynamics Simulations , J. Chem. TheoryComput. , 4947 (2016), doi:10.1021/acs.jctc.6b00241.[25] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco and K. Schulten, Accel-erating molecular modeling applications with graphics processors , J. Comput. Chem. (16),2618 (2007), doi:10.1002/jcc.20829.[26] S. Guo-Liang, W. Jing-Wei, L. Zhen-Hua, W. Wen-Ning and F. Kang-Nian, Molecular DynamicsSimulation Using Graphics Processing Units , Chem. J. Chinese U. (12), 2425 (2008).[27] W. Liu, B. Schmidt, G. Voss and W. M¨ueller-Wittig, Accelerating molecular dynamics sim-ulations using Graphics Processing Units with CUDA , Comput. Phys. Commun. (9), 634(2008), doi:10.1016/j.cpc.2008.05.008.[28] J. A. van Meel, A. Arnold, D. Frenkel, S. F. P. Zwart and R. G. Belleman, Harvesting graphicspower for MD simulations , Mol. Simul. (3), 259 (2008), doi:10.1080/08927020701744295.[29] M. S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand, A. L. Beberg, D. L.Ensign, C. M. Bruns and V. S. Pande, Accelerating Molecular Dynamic Simulation on GraphicsProcessing Units , J. Comput. Chem. (6), 864 (2009), doi:10.1002/jcc.21209.[30] J. Xu, Y. Ren, W. Ge, X. Yu, X. Yang and J. Li, Molecular dynamics simulationof macromolecules using graphics processing unit , Mol. Simul. (14), 1131 (2010),doi:10.1080/08927022.2010.506512.[31] H. J. Myung, R. Sakamaki, K. J. Oh, T. Narumi, K. Yasuoka and S. Lee, Accelerating MolecularDynamics Simulation Using Graphics Processing Unit , Bull. Korean Chem. Soc. (12), 3639(2010), doi:10.5012/bkcs.2010.31.12.3639.[32] J. A. Baker and J. D. Hirst, Molecular Dynamics Simulations Using Graphics Processing Units ,Mol. Inf. (6-7), 498 (2011), doi:10.1002/minf.201100042.[33] D. C. Rapaport, Enhanced molecular dynamics performance with a programmable graphicsprocessor , Comput. Phys. Commun. (4), 926 (2011), doi:10.1016/j.cpc.2010.12.029.[34] A. P. Ruymgaart, A. E. Cardenas and R. Elber, MOIL-opt: Energy-Conserving Molec-ular Dynamics on a GPU/CPU System , J. Chem. Theory Comput. (10), 3072 (2011),doi:10.1021/ct200360f.[35] K. Oguchi, Y. Shibuta and T. Suzuki, Accelerating Molecular Dynamics Simulation Performedon GPU , J. Jpn. I. Met (7), 462 (2012).[36] S. P´all and B. Hess, A flexible algorithm for calculating pair interactions on SIMD architectures ,Comput. Phys. Commun. (12), 2641 (2013), doi:10.1016/j.cpc.2013.06.003.[37] D. B. Kirk and W.-m. W. Hwu, Programming Massively Parallel Processors: A Hands-onApproach , Morgan Kaufmann (2010).[38] NVIDIA, NVIDIA CUDA Compute Unified Device Architecture Programming Guide , NVIDIA,Santa Clara, CA, USA, Version 6.5 (2014). 20 ciPost Physics Submission [39] NVIDIA, NVIDIA’s next generation CUDA compute architecture: Kepler GK110 , White paper.Available online, Version 1.0, 24 pp. (2012).[40] NVIDIA, NVIDIA Tesla P100. The Most Advanced Datacenter Accelerator Ever Built FeaturingPascal GP100, the Worldø’s Fastest GPU. , Whitepaper. Available online. 45 pp. (2016).[41] S. Karmakar, C. Dasgupta and S. Sastry, Growing length and time scales in glass-formingliquids , PNAS , 3675 (2009), doi:10.1073/pnas.0811082106.[42] M. Bernaschi, M. Bisson and M. Fatica, Colloquium: Large scale simulations on GPU clusters ,Eur. Phys. J. B (6), 158 (2015), doi:10.1140/epjb/e2015-60180-8.[43] Z. Fan, T. Siro and A. Harju, Accelerated molecular dynamics force evaluation on graphicsprocessing units for thermal conductivity calculations , Comput. Phys. Commun. , 1414(2013), doi:10.1016/j.cpc.2013.01.008.[44] J. Glaser, T. D. Nguyen, J. A. Anderson, P. Lui, F. Spiga, J. A. Millan, D. C. Morse and S. C.Glotzer, Strong scaling of general-purpose molecular dynamics simulations on GPUs , Comput.Phys. Commun. , 97 (2015), doi:10.1016/j.cpc.2015.02.028.[45] S. M. A. Malek, R. K. Bowles, I. Saika-Voivod, F. Sciortino and P. H. Poole, ”Swarm relaxation”: Equilibrating a large ensemble of computer simulations (2017), arxiv:1710.10622 .[46] S. Toxvaerd and J. C. Dyre, Shifted forces in molecular dynamics , J. Chem. Phys. , 081102(2011), doi:10.1063/1.3558787.[47] S. Toxvaerd, O. J. Heilmann, T. S. I. T. B. Schrøder and J. C. Dyre, Time-reversiblemolecular dynamics algorithms with bond constraints , J. Chem. Phys. , 064102 (2009),doi:10.1063/1.3194785.[48] T. S. Ingebrigtsen and J. C. Dyre, NVU dynamics. III. Simulating molecules at constant potentialenergy , J. Chem. Phys. , 244101 (2012), doi:10.1063/1.4768957.[49] B. M. Axilrod and E. Teller, Interaction of the van der Waals Type Between Three Atoms , J.Chem. Phys. , 299 (1943), doi:10.1063/1.1723844.[50] K. W. Jacobsen, P. Stoltze and J. K. Nørskov, A semi-empirical effective medium theory formetals and alloys , Surf. Sci. , 394 (1996), doi:10.1016/0039-6028(96)00816-3.[51] G. J. Martyna, D. J. Tobias and M. L. Klein, Constant pressure molecular dynamics algorithms ,J. Chem. Phys. (5), 4177 (1994), doi:http://dx.doi.org/10.1063/1.467468.[52] T. S. Ingebrigtsen, S. Toxvaerd, O. J. Heilmann, T. B. Schrøder and J. C. Dyre, NVU dynamics.I. Geodesic motion on the constant-potential-energy hypersurface , J. Chem. Phys. , 104101(2011), doi:10.1063/1.3623585.[53] T. S. Ingebrigtsen, S. Toxvaerd, T. B. Schrøder and J. C. Dyre, NVU dynamics. II. Comparing tofour other dynamics , J. Chem. Phys. , 104102 (2011), doi:10.1063/1.3623586.[54] S. Plimpton, Fast parallel algorithms for short-range molecular-dynamics , J. Comp. Phys. ,1 (1995), doi:10.1006/jcph.1995.1039. 21 ciPost Physics Submission [55] L. Rovigatti, P. ˇSulc, I. Z. Reguly and F. Romano, A comparison between parallelizationapproaches in molecular dynamics simulations on gpus , J. Comp. Chem. , 1 (2015),doi:10.1002/jcc.23763.[56] Thrust parallel algorithms library , http://thrust.github.io (2015).[57] D. Wolf, P. Keblinski, S. R. Phillpot and J. Eggebrecht, Exact method for the simulation ofCoulombic systems by spherically truncated, pairwise r(-1) summation , J. Chem. Phys. (17),8254 (1999), doi:10.1063/1.478738.[58] U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen, A SmoothParticle Mesh Ewald Method , J. Chem. Phys. (19), 8577 (1995), doi:10.1063/1.470117.[59] M. J. Harvey and D. F. G., An Implementation of the Smooth Particle Mesh Ewald Method onGPU Hardware , J. Chem. Theory Comput. , 2371 (2009), doi:10.1021/ct900275y.[60] N. Ganesan, B. A. Bauer, T. R. Lucas, S. Patel and M. Taufer, Structural, Dynamic, and Elec-trostatic Properties of Fully Hydrated DMPC Bilayers From Molecular Dynamics SimulationsAccelerated with Graphical Processing Units (GPUs) , J. Comput. Chem. (14), 2958 (2011),doi:10.1002/jcc.21871.[61] J. S. Hansen, T. B. Schrøder and J. C. Dyre, Simplistic Coulomb Forces in Molecular Dynamics:Comparing the Wolf and Shifted-Force Approximations , J. Phys. Chem. B (19), 5738 (2012),doi:10.1021/jp300750g.[62] J. P. Hansen and I. R. McDonald, Statistical mechanics of dense ionized matter. IV. Den-sity and charge fluctuations in a simple molten salt , Phys. Rev. A11