[PDF] Heterogeneous Parallelization and Acceleration of Molecular Dynamics Simulations in GROMACS

Abstract

The introduction of accelerator devices such as graphics processing units (GPUs) has had profound impact on molecular dynamics simulations and has enabled order-of-magnitude performance advances using commodity hardware. To fully reap these benefits, it has been necessary to reformulate some of the most fundamental algorithms, including the Verlet list, pair searching and cut-offs. Here, we present the heterogeneous parallelization and acceleration design of molecular dynamics implemented in the GROMACS codebase over the last decade. The setup involves a general cluster-based approach to pair lists and non-bonded pair interactions that utilizes both GPUs and CPU SIMD acceleration efficiently, including the ability to load-balance tasks between CPUs and GPUs. The algorithm work efficiency is tuned for each type of hardware, and to use accelerators more efficiently we introduce dual pair lists with rolling pruning updates. Combined with new direct GPU-GPU communication as well as GPU integration, this enables excellent performance from single GPU simulations through strong scaling across multiple GPUs and efficient multi-node parallelization.

Full PDF

HHeterogeneous Parallelization and Acceleration of Molecular DynamicsSimulations in GROMACS

Szilárd Páll, Artem Zhmurov, Paul Bauer, Mark Abraham, Magnus Lundborg, Alan Gray, Berk Hess, a) and Erik Lindahl

2, 5, a) Swedish e-Science Research Center, PDC Center for High Performance Computing, KTH Royal Institute of Technology, 100 44Stockholm, Sweden Department of Applied Physics, Swedish e-Science Research Center, Science for Life Laboratory, KTH Royal Institute ofTechnology, Box 1031, 171 21 Solna, Sweden ERCO Pharma AB, Stockholm, Sweden NVIDIA Corporation, UK Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 171 21 Solna, Sweden (Dated: June 17, 2020)

The introduction of accelerator devices such as graphics processing units (GPUs) has had profound impact on moleculardynamics simulations and has enabled order-of-magnitude performance advances using commodity hardware. To fullyreap these beneﬁts, it has been necessary to reformulate some of the most fundamental algorithms, including the Verletlist, pair searching and cut-offs. Here, we present the heterogeneous parallelization and acceleration design of moleculardynamics implemented in the GROMACS codebase over the last decade. The setup involves a general cluster-basedapproach to pair lists and non-bonded pair interactions that utilizes both GPUs and CPU SIMD acceleration efﬁciently,including the ability to load-balance tasks between CPUs and GPUs. The algorithm work efﬁciency is tuned for eachtype of hardware, and to use accelerators more efﬁciently we introduce dual pair lists with rolling pruning updates.Combined with new direct GPU-GPU communication as well as GPU integration, this enables excellent performancefrom single GPU simulations through strong scaling across multiple GPUs and efﬁcient multi-node parallelization.

I. INTRODUCTION

Molecular dynamics (MD) simulation has had tremendoussuccess in a number of application areas the last two decades,in part due to hardware improvements that have enabled stud-ies of systems and timescales that were previously not fea-sible. These advances have also made it possible to intro-duce better algorithms and longer simulations have enabledmore accurate calibration of force ﬁelds against experimen-tal data, all of which have contributed to increasing trust incomputational studies. However, the high computational costof evaluating forces between all particles combined with in-tegrating over short time steps ( ∼ a) To whom correspondence should be adressed; [email protected],[email protected] even for HPC systems a high rate of producing trajectories isimperative to sample dynamics covering adequate timescales,which means cost-efﬁciency and throughput are of paramountimportance .The design choices in GROMACS are guided by a bottom-up approach to parallelization and optimization, partly due tothe code’s roots of high performance on cost-efﬁcient hard-ware. This is not without challenges; good arguments can bemade for focusing either top-down on scaling or just stickingto single-GPU simulations. However, by employing state-of-the-art algorithms and efﬁcient parallel implementations thecode is able to target hardware and efﬁciently parallelize fromthe lowest level of SIMD (single instruction, multiple data)vector units to multiple cores and caches, accelerators, anddistributed-memory HPC resources.We believe this approach makes great use of limited com-pute resources to improve research productivity , and it is in-creasingly enabling higher absolute performance on any givenresource. Exploiting low-level parallelism can be tedious andhas often been avoided in favor of using more hardware toachieve the desired time-to-solution. However, the evolutionof hardware is making this trade-off increasingly difﬁcult.The end of microprocessor frequency scaling and the con-sequent increase in hardware parallelism means that target-ing all levels of parallelism is a necessity rather than option.The MD community has been at the forefront of investing inthis direction , and our early work on scalable algorithms ,ﬁne-grained parallelism and low-level portable paralleliza-tion abstractions have been previous steps on this path.Accelerators such as GPUs are expected to provide the ma-jority of raw FLOPS in upcoming Exascale machines. How-ever, the impact of GPUs can also be seen in low- to mid-range capacity computing, especially in ﬁelds like MD that a r X i v : . [ phy s i c s . c o m p - ph ] J un have been able to utilize the high instruction throughput aswell as single precision capabilities; this has had particularlyhigh impact in making consumer GPU hardware available forscientiﬁc computing.While algorithms with large amounts of ﬁne-grained par-allelism are well-suited to GPUs, tasks with little parallelismor irregular data-access are better suited to CPU architectures.Accelerators have become increasingly ﬂexible, but still re-quire host systems equipped with a CPU. While there hasbeen some convergence of architectures, the difference be-tween latency- and throughput-optimized functional units isfundamental, and utilizing each of them for the tasks at whichthey are best suited requires heterogeneous parallelization .This typically employs the CPU also for scheduling work,transferring data and launching computation on the acceler-ator, as well as inter- and intra-node communication. Accel-erator tasks are launched asynchronously using APIs such asCUDA, OpenCL or SYCL to allow concurrent CPU–GPU ex-ecution. The heterogeneous parallelization model adds com-plexity which comes at a cost, both in terms of hardware (la-tency, complex topology) and programmability, but it pro-vides ﬂexibility (every single algorithm does not need to beimplemented on the accelerator) and opportunities for higherperformance. Heterogeneous systems are evolving fast withvery tightly coupled compute units, but the heterogeneity inHPC will remain and is likely best addressed explicitly.Our ﬁrst GPU support was introduced in GROMACS 4.5 and relied on a homogeneous acceleration by implementingthe entire MD calculation on the GPU. The same approach hasbeen used by several codes (e.g. ACEMD , AMBER ,HOOMD-blue , FENZI , DESMOND-GPU ) and has thebeneﬁt that it keeps the GPU busy, avoiding communicationas long as scaling is not a concern. However, this ﬁrst ap-proach also had shortcomings: only algorithms ported to theGPU can be used in simulations, which limits applicability inlarge community codes. Implementing the full set of MD al-gorithms on all accelerator frameworks is not practical fromporting and maintenance concerns. In addition, our experi-ence showed that outperforming the highly optimized CPUcode in GROMACS by only relying on GPUs was difﬁcult, es-pecially in parallel runs where the CPU-accelerated code ex-cels. To make use of GPUs without giving up feature support,while providing speedup to as many simulation use-cases aspossible, utilizing both CPU and GPU in heterogeneous par-allelization was necessary.Heterogeneous ofﬂoad is employed by several MD codes(NAMD , LAMMPS , CHARMM , or GENESIS ).However, here too the GROMACS CPU performance pro-vided a challenge: since the tuned CPU SIMD kernels arealready capable of achieving iteration rates around 1 ms perstep without GPU acceleration, the relative speedup of addingan accelerator was less impressive at the time.To address this, we started from scratch by recasting al-gorithms into a future-proof form to exploit both GPUs andCPUs (including multiple devices) to provide very high ab-solute performance while supporting virtually all features nomatter what hardware is available. The pair-interaction cal-culation was redesigned with a cluster pair algorithm to ﬁt Bonded FPME F

IntegrationDomain decomp.Pair search

Non-bondedpair F

ReduceF

Other F

Figure 1. Structure of an MD step. There is concurrency available forparallelization both for different force terms and within tasks (hori-zontal bars). The inner (grey) loop only computes forces and inte-grates, while the less frequent outer loop (blue dashed) involves tasksto decompose the problem. modern architectures, which replaces the traditional Verlet listbased on particles. Clusters are optimized to ﬁt the hardware,and the classical cut-off setup has evolved into accuracy-based approaches for simulation settings to allow multi-levelload balancing and on-the-ﬂy tuning based on system prop-erties. Together with CPU SIMD parallelization and multi-threading , this has allowed efﬁcient ofﬂoading of short-range non-bonded calculations to GPUs and brought majorgains in performance .New algorithms and the heterogeneous acceleration frame-work have made it possible to track track the shift towardsdense heterogeneous machines and balancing CPU/GPU uti-lization by ofﬂoading more work to powerful accelerators. The most recent release has almost come full circle to al-low ofﬂoading full MD steps, but this version also supportsmost features of the MD engine by utilizing both CPUs andGPUs, it targets multiple accelerator architectures, and pro-vides scaling both across multiple accelerator devices andmultiple nodes. This bottom-up heterogeneous accelerationapproach provides ﬂexibility, portability and performance fora wide range of target architectures, ranging from laptops tosupercomputers and from CPU-only machines to dense multi-GPU clusters. Below we present the algorithms and imple-mentations that have enabled it.

II. COMPUTATIONAL CHALLENGES IN MDSIMULATIONS

The core of classical MD is the time-evolution of parti-cle systems by numerically integrating Newton’s equations ofmotion. This requires calculating forces for every time step,which is the main computational cost. While this can be par-allelized, the integration step is inherently iterative.The total force on each particle involves multiple terms:non-bonded pair interactions (typically Lennard-Jones andelectrostatics), bonded interactions (e.g. bonds, angles, tor-sions), and possibly terms like restraints or external forces.Given particle coordinates, each term can be computed inde-pendently (Fig. 1).While there are good examples of MD applications with gi-gantic systems , the most common approach is to keep thesimulation size ﬁxed and small to improve absolute perfor-mance. Improving the time-to-solution thus requires strongscaling. Historically, virtually all time was spent computingforces, which made it straightforward to parallelize, but well-optimized MD engines now routinely achieve step iterationrates at or below a millisecond . Thus, the MD prob-lem is increasingly becoming latency-sensitive where syn-chronization, bandwidth and latency both within nodes andover the network are becoming major challenges for homo-geneous as well as heterogeneous parallelization due to Am-dahl’s law. This has partly been compensated for by increas-ing density of HPC resources where jobs rely more on intra-node communication than the inter-node network. This in turnhas enabled a shift from coarse parallelism using MPI and do-main decomposition to ﬁner-grained concurrency within forcecalculation tasks, which is better suited for multicore and ac-celerator architectures.Dense multi-GPU servers often make it possible for simu-lations to remain on a single node, but as the performance hasimproved the previous fast intra-node communication has be-come the new bottleneck compared to the fast synchronizationwithin a single accelerator or CPU. As a side-effect, simula-tions that a decade ago required extreme HPC resources arenow routinely performed on single nodes (often with amaz-ingly cost-efﬁcient consumer hardware). This has commodi-tized MD simulations, but to explore biological events we de-pend on advancing absolute performance such that individualsimulations cover dynamics in the hundreds of microseconds,which can then be combined with ensemble-level parallelismto sample multi-millisecond processes.In the mid 2000s, processors hit a frequency wall andthe increases in transistor count were instead used for morefunctional units, leading to multi- and manycore designs.Specialization has enabled improvements such as wider andmore feature-rich SIMD units as well as application-speciﬁcinstructions, and recent GPUs have also brought compute-oriented application-speciﬁc acceleration features. Computeunits however need to be fed data, but memory subsystemsand interconnects have not showed similar improvement. In-stead, there has been an increasing discrepancy between thespeed of computation and data movement. The arithmetic in-tensity per memory bandwidth required to fully utilize com-pute units has increased threefold for CPUs in the last decade,and nearly tenfold for GPUs . In particular, most acceleratorsstill rely on communication over the slowly evolving PCIe buswhile their peak FLOP rate has increased ﬁve times and GRO-MACS compute kernel performance grew by up to an order ofmagnitude . This imbalance has made heterogeneous accel-eration and overlapping compute and communication increas-ingly difﬁcult. This is partly being addressed through tighterhost–accelerator integration with NVIDIA NVLink amongthe ﬁrst (other technologies include Intel CXL or AMD In-ﬁnity Fabric), and this trend is likely to continue.MD as a ﬁeld was established in an era where the all-important goal was to save arithmetic operations, which iseven reﬂected in functional forms such as the Lennard-Jonespotential (the power-12 repulsion is used as the square ofthe power-6 dispersion instead of an expensive exponential).However, algorithm design and parallelization is shifting fromsaving FLOPS to efﬁcient data layout, reducing and optimiz- ing data movement, overlapping communication and com-putation, or simply recomputing data instead of storing andreloading. This is shifting burden of extracting performanceto the software, including authors of compilers, libraries andapplications, but it pays off with signiﬁcant performance im-provements, and the resulting surplus of FLOPS suddenlyavailable will enable the introduction of more accurate func-tional forms without excessive cost. III. PARALLELIZATION OF MD IN GROMACSA. The structure of the MD algorithm

The force terms computed in MD are independent and ex-pose task parallelism within each MD step. Force tasks typ-ically only depend on positions from the previous step andother constant data, although domain decomposition (DD)introduces an additional dependency on the communicationof particle coordinates from other nodes. The per-step con-currency in computing forces (Fig. 1) is a central aspect inofﬂoad-based parallel implementations. The reduction to sumforces for integration acts as a barrier; only when new co-ordinates become available can the next iteration start. Thedomain decomposition algorithm on the other hand exposescoarse-grained data parallelism through a spatial decomposi-tion of the particle system. Within each domain, ﬁner-graineddata-parallelism is also available (in particular in non-bondedpair interactions), but to improve absolute performance it isthe total step iteration rate in this high-level ﬂowchart that hasto be reduced to the order of 100 µs.

B. Multi-level parallelism

Modern hardware exposes multiple levels of parallelismcharacterized by the type and speed of data access and com-munication between compute units. Hierarchical memory aswell as intra- and inter-node interconnects facilitate handlingdata close to compute units. Targeting each level of paral-lelism has been increasingly important on petascale architec-tures, and GROMACS does so to improve performance .On the lowest level, SIMD units of CPUs offer ﬁne-graineddata-parallel execution. Similarly, modern GPUs rely onSIMD-like execution called SIMT (single instruction, mul-tiple thread) where a group of threads execute in lockstep(width typically 32–64). CPUs have multiple cores, fre-quently with multiple hardware threads per core. Simi-larly, GPUs contain groups of execution units (multiproces-sors/compute units), but unlike on CPUs distributing workacross these cannot be controlled directly, which poses loadbalancing challenges. On the node level, multiple CPUs com-municate through the system bus. Accelerators are attachedto the host CPU or other GPUs using a dedicated bus. TheseCPU–GPU and GPU–GPU buses add complexity in hetero-geneous systems and, together with the separate global mem-ory, represent some of the main challenges in a heterogeneoussetup. Finally, the network is essentially a third-level bus forinter-node communication. An important concern is not onlyfast access on each level, but also the non-uniform memory ac-cess (NUMA): moving data between compute units has non-uniform cost. This also applies to intra-node buses as each ac-celerator is typically connected only to one NUMA domain,not to mention inter-node interconnects where the topologycan have large impact on communication latency.The original GROMACS approach was largely focused onefﬁcient use of low- to medium-scale resources, in partic-ular commodity hardware, through highly tuned assembly(and later SIMD) kernels. The original MPI- (and PVM)based scaling was less impressive, but in version 4.0 thiswas replaced with a state-of-the-art neutral-territory domain-decomposition combined with fully ﬂexible 3D dynamicload balancing (DLB) of triclinic domains. This is combinedwith a high-level task decomposition that dedicates a subsetof MPI ranks to long-range PME electrostatics to reduce thecost of collective communication required by the 3D FFTs,which means multiple-program, multiple-data (MPMD) par-allelization. Domain decomposition was initially also used forintra-node parallelism using MPI as well as our own thread-based MPI library implementation . Since the DD algorithmensures data locality, this has been a surprisingly good ﬁt toNUMA architectures, but it comes with challenges related toexposing ﬁner-grain parallelism across cores and limits theability to make use of efﬁcient data-exchange with sharedcaches. Algorithmic limitations (minimum domain size) alsorestrict the amount of parallelism that can be exposed in thismanner. While the design had served well, signiﬁcant exten-sions were required in order to target manycore and heteroge-neous GPU-accelerated architectures.The multilevel heterogeneous parallelization was born froma redesign that extended the parallelization to separately tar-get each level of hardware parallelism, ﬁrst introduced in ver-sion 4.6 . New algorithms and programming models havebeen adopted to expose parallelism with ﬁner granularity.Our ﬁrst major change was to redesign the pair-interactioncalculation to provide a ﬂexible and future-proof algorithmwith accuracy-based settings and load balancing capabilities,which can target either wide SIMD or GPU architectures. Onthe CPU front, SIMD parallelism is used for most major time-consuming parts of the code. This was necessitated by Am-dahl’s law: as the performance of non-bonded kernels andPME improved, previously insigniﬁcant components such asintegration turned into new bottlenecks. This was made fullyportable by the introduction of the GROMACS SIMD abstrac-tion layer, which started as the replacement of raw assemblywith intrinsics and now supports a range of CPU architecturesusing 14 different SIMD instruction sets , with additionalones in development.To utilize both CPUs and GPUs, intra-node paralleliza-tion was extended with an accelerator ofﬂoad layer and mul-tithreading. The ofﬂoad layer schedules GPU tasks anddata movement to ensure concurrent CPU–GPU executionand it has evolved as more ofﬂoad abilities were added.OpenMP multithreading was ﬁrst introduced to improve PMEscaling , and later extended to the entire MD engine. To al-low assembling larger units of computation for GPUs, we in- CPUGPUCPUGPU < CPUGPUCPUGPU

AB DEC

Bonded F PME Integration,ConstraintsPairSearch Non-bonded F

OtherForces ReduceForces

CPU

Figure 2. Execution ﬂows of (A) single-CPU and ﬂavors of theCPU–GPU heterogeneous ofﬂoad designs. Incremental task ofﬂoad-ing is shown for (B) short-range non-bonded interactions (green),(C) PME (orange) and dynamic list pruning (blue cross-hatch), (D)bonded interactions (purple), and (E) integration/constraints (gray).With asynchronous ofﬂoad, force reduction requires resolving de-pendencies. The explicit ones are enforced through synchronization(CPU) or asynchronous events (GPU) illustrated by black arrows.Implicit synchronization is fulﬁlled by the sequential dependenciesbetween tasks on the CPU timeline, as well as those on the sameGPU timeline (corresponding to in-order streams). creased the size of MPI tasks and have them run across mul-tiple cores instead of dispatching work from a large numberof MPI ranks per node. This avoids bottlenecks in schedul-ing and execution of small GPU tasks. Multithreading algo-rithms have also gone through several generations with a focuson data-locality, cache-optimizations, and load balancing, im-proving scalability to larger thread counts. Hardware topologydetection based on the hwloc library is used to guide auto-mated thread afﬁnity setting, and NUMA considerations aretaken into account when placing threads.For single-node CPU-only parallelism, execution is stilldone with sequentially dependent tasks (Fig. 2A), which al-lows relying on implicit dependencies and sharing outputacross force calculations. Expressing concurrency (Fig. 1)to allow parallel execution over multiple cores and GPU hasrequired explicitly expressing dependencies. The new de-sign uses multi-threading and heterogeneous extensions forhandling force accumulation and reduction. On the CPU,per-thread force accumulation buffers are used with cache-efﬁcient sparse reduction instead of atomic operations. Thisis important e.g. for bonded interactions where a thread typ-ically only contributes forces to a small fraction of particlesin a domain. When combined with accelerators, force taskscan be assigned to either CPU or GPU, with additional re-mote force contributions received over MPI. With forces dis-tributed in CPU and GPU memories, we use a new reductiontree to combine all contributions. Explicit dependencies ofthis reduction for the single GPU case are indicated by blackarrows in Fig. 2. Fulﬁlling dependencies may require CPU–GPU transfers to the compute unit that does the reduction, andthe heterogeneous schedule is optimized to ensure these over-lap with computation (Fig. 2 panels C and D). IV. HETEROGENEOUS PARALLELIZATION

Asynchronous ofﬂoading in GROMACS is implementedusing either CUDA or OpenCL APIs, and has two main func-tionalities: explicit control of CPU–GPU data movement aswell as asynchronous scheduling of concurrent task executionand synchronization. This design aims to maximize CPU–GPU execution overlap, reduce the number of transfers bymoving data early, keeping data on the accelerator as long aspossible, ensuring transfer is overlapping with computation,and optimize task scheduling for the critical path to reducethe time per step.

A. Oﬄoading force computation

GROMACS initially chose to ofﬂoad the non-bonded pairinteractions to the GPU, while overlapping with PME andbonded interactions being evaluated on the CPU (Fig. 2B).While this approach requires CPU resources, it has the ad-vantage of supporting domain decomposition as well as allfunctionality, since any special algorithm can execute on theCPU . When combined with DD, interactions with parti-cles not local to the domain depend on halo exchange. This ishandled by splitting non-bonded work into two kernels: onefor local-only interactions and one for interactions that involvenon-local particles. Based on co-design with NVIDIA, streampriority bits were introduced in the GPU hardware and ex-posed in CUDA. This made it possible to launch non-localwork in a high priority stream to preempt the local kernel andreturn remote forces early, while the local kernel executioncan overlap with communication. Currently only a single pri-ority bit is available, but increasing this should facilitate ad-ditional ofﬂoading; this is a less complex solution than e.g.a persistent kernel with dynamic workload-switching. Theintra-node load balancing, together with control and data ﬂowof the heterogeneous setup with with short-range force ofﬂoad(non-bonded and bonded) is illustrated in Fig. 3.The gradual shift in CPU–GPU performance balance inheterogeneous systems brought the need for ofﬂoading fur-ther force tasks to avoid the CPU becoming a bottleneck or,from a cost perspective, not needing expensive CPUs. Con-sequently, we added ofﬂoad of PME long-range electrostaticsboth in CUDA and OpenCL. PME is ofﬂoaded to a separatestream (Fig. 2C) to allow overlap with short-range interac-tions. The main challenge arises from the two 3D FFTs re-quired. Simulations rely on small grids (typical dimensions32–256), which has not been an optimization target in GPUFFT libraries, so scaling FFT to multiple GPUs would oftennot provide meaningful beneﬁts. However, we can still allowmulti-GPU scaling by reusing our MPMD approach and plac-ing the entire PME execution on a speciﬁc GPU. We have alsodeveloped a hybrid PME ofﬂoad to allow back-ofﬂoading theFFTs to the CPU, while the rest of the work is done on theGPU. This is particularly useful for legacy GPU architectureswhere FFT performance can be low. Additionally, it couldbe beneﬁcial for strong scaling on next-generation machineswith high-bandwidth CPU–GPU interconnects to allow grid transfer overlap and exploiting well-optimized parallel CPU3D FFT implementations. The last force task to be ofﬂoadedwas the bonded interactions. Without DD, this is executed onthe same stream as the short-range non-bonded task (Fig. 2D).Both tasks consume the same non-bonded layout-optimizedcoordinates and share force output buffer. With DD, thebonded task is scheduled on the nonlocal stream (Fig. 3) asit is often small and not split by locality.The force ofﬂoad design inherently requires data trans-fer to/from the GPU every step, copying coordinates to, andforces from, the GPU prior to force reduction (black boxesin Fig. 2B–E), followed by integration on the CPU. Withaccelerator-heavy systems this can render a ofﬂoad-basedsetup CPU-limited. In addition, GPU compute to PCIe band-width is also imbalanced. High performance interconnects arenot common, and typically used only for GPU–GPU commu-nication. This disadvantages the ofﬂoad design as it leavesthe GPU idle for part of each step, although this can partly becompensated for with pair list pruning described in the algo-rithms section. Pipelininging force computation, transfer andintegration or using intra-domain force decomposition can re-duce the CPU bottlenecks . However, slow CPU–GPU trans-fers are harder to address by overlapping since computation isfaster than data movement. For this reason our recent effortshave aimed to increasingly eliminate CPU–GPU data move-ment and rely on direct GPU communication for scaling. B. Oﬄoading complete MD iterations

To avoid the data transfer overhead , GROMACS now sup-ports executing the entire innermost iteration, including in-tegration, on accelerators (Fig. 2E). This can fully removethe CPU from the critical path, and reduces the number ofsynchronization events. At the same time, the CPU is em-ployed for pair search and domain decomposition (done in-frequently), and special algorithms can use the now free CPUresources during the GPU step. In addition to the force tasksperformed on the GPU, the data ownership for the particlecoordinates, velocities and forces is now also moved to theGPU. This allows shifting the previous parallelization trade-off and minimize GPU idle time. The inner MD loop howeverstill supports forces that are computed on the CPU, and oftenthe cycle of copying the data from/to the GPU and evaluat-ing these forces on the CPU takes less time than the forcetasks assigned to the GPU (Fig. 2E). The CPU can now beconsidered a supporting device to evaluate forces not imple-mented on GPU (e.g. CMAP corrections), or those not well-suited for GPU evaluation (e.g. pulling forces acting on a sin-gle atom). This keeps the GPU code base to a minimum andbalances the load by assigning a different set of tasks to theGPU depending on the simulation setup and hardware con-ﬁguration. This is highly beneﬁcial both for high-throughputand multi-simulation experiments on GPU-dense compute re-sources or upgrading old systems with a high-end GPU. Atthe same time, it also allows making efﬁcient use of com-munication directly between GPUs, including dedicated high-bandwidth/low-latency interconnects where available. In our

Average CPU-GPU overlap:70-90% per step

PME mesh F IntegrationConstraints

Wait non-local F

Pairsearch D H n o n l o c a l F , E H D l o c a l x , q H D p a i r li s t Non-localnon-bonded F D H l o c a l F Wait local F H D n o n l o c a l x , q Pair-search & domain-decompositionevery 10-250 steps

MD step

Clearbuffers

Local non-bonded F ... preemptedby non-local kernel

MPI comm:receive non-local x MPI comm: send non-local F

Domaindecomp.

DDcomm 3D FFTcomm constraintcomm

CPU

LocalstreamNon-localstream (high priority)

GPU

RollingpruneBonded F

Listpruning

Other FBalance PME –non-bonded A Balance DD + search with inner list pruning cost B Figure 3. Multi-domain heterogeneous control and data ﬂow with short-range non-bonded and bonded tasks ofﬂoaded. Horizontal linesindicate CPU/GPU timelines with inner MD steps (grey) and pair-search/DD (dashed blue). Data transfers are indicated by vertical arrows(solid for CPU–GPU, dashed for MPI; H2D is host-to-device, D2H device-to-host). The area enclosed in green is concurrent CPU–GPUexecution, while the red indicates no overlap (pair search and DD). Task load balancing is used to increase CPU–GPU overlap (dotted arrows)by shifting work between PME and short-range non-bonded tasks (A) and balancing CPU-based pair search/DD with list pruning (B). most recent implementation, data movement can be automati-cally routed directly between GPUs instead of staging com-munication through CPU memory. When a CUDA-awareMPI library is used, communication operates directly on GPUmemory spaces. Our own thread-MPI implementation relieson direct CUDA copies. Additionally, by exchanging CUDAevents, it can use stream synchronization across devices whichallows fully asynchronous communication ofﬂoad leaving theCPU free from both compute and coordination/wait. Theexternal MPI implementation requires additional CPU–GPUsynchronization prior to communication, but allows the newfunctionality to be used across multiple nodes. Much of theGPU–GPU communication, either between short-range tasksor between short-range and PME tasks, is of a halo exchangenature where non-contiguous coordinates and forces are ex-changed, which requires GPU buffer packing and unpackingoperations. In particular for this, keeping the outer loop of do-main decomposition and pair search on the CPU turns out tobe a clear advantage, since the index map building is a some-what complex random access operation, but once completethe data is moved to the accelerator and reused across multi-ple simulation time steps.

V. ALGORITHM DETAILSA. The cluster pair algorithm

The Verlet list and linked cell list algorithms for ﬁndingparticles in spatial proximity and constructing lists of short-range neighbors were some of the ﬁrst algorithms in the ﬁeldand are cornerstones of MD. However, while the Verlet listexposes a high degree of parallelism its traditional formula-tion expresses this in an irregular form, which is ill-suited forSIMD-like architectures. To reduce the execution imbalance due to varying list lengths, padding or binning have beenused. However, the community has largely converged on re-formulating the problem by grouping interactions into ﬁxedsize work units instead. A common approach is to assign different i -particles toeach SIMT thread requiring a separate j particle loaded foreach pair interaction. This leads to memory access divergencewhich becomes a bottleneck in SIMD-style implementations,even with arithmetically intensive interactions. Sorting par-ticles to increase spatial locality for better caching improvesperformance , but the inherently scattered access is stillinefﬁcient. Early efforts used GPU textures to improvedata reuse, but this is hard to control as the effectiveness de-pends on the size of the j -lists relative to cache.Given the need for increasing the arithmetic-to-memory op-eration ratio, we formulated an algorithm that regularizes theproblem and increases data reuse. The goal is to load j -particle data as efﬁciently and rarely as possible and reuse itfor multiple i -particles, roughly similar to blocking algorithmsin matrix-matrix multiplication. Our cluster pair algorithmuses a ﬁxed number of particles per cluster, and pairs of suchclusters rather than individual particles are the unit of comput-ing short-range interactions. Hence, we compute interactionsbetween i -clusters of N particles and a list of j -clusters eachof M particles. M is adjusted to the SIMD width while N al-lows balancing data reuse with register usage. In addition toa data layout that allows efﬁcient access, and that N × M in-teractions are calculated for every load/store, the algorithm iseasy to adapt to new architectures or SIMD widths. Since thealgorithmic efﬁciency will be higher for smaller clusters, wecan also place two sets of the N i -cluster particles in a wideSIMD register of length 2 M , which our SIMD layer supportson all hardware where sub-register load/store operations areefﬁcient. The clusters and pair list are obtained during searchusing a regular grid in x and y dimensions but binning theresulting z columns into cells with ﬁxed number of particles(in contrast to ﬁxed-size cells) that deﬁne our clusters (Fig. 4,left). The irregular 3D grid is then used to build the clusterpair list .Following the approach used for CPU SIMD, choosing M to match the GPU execution width may seem suitable. How-ever, as GPUs typically have large execution width, the re-sulting cluster size would greatly increase the fraction of zerointeractions evaluated. The raw FLOP-rate would be high, butefﬁciency low. To avoid this, our GPU algorithm calculatesinteractions between all pairs of an i − j cluster pair insteadof assigning the same i - and different j -particles to each GPUhardware thread. Hence, we adjust N × M to the GPU exe-cution width. However, evaluating all N × M interactions ofa cluster pair in parallel would eliminate the i -particle datareuse. The arithmetic intensity required to saturate modernprocessors is quite similar across the board , so restoringdata reuse is imperative. We achieve this by introducing a super-cluster grouping (Fig. 4). A joint pair list is built forthe super-cluster as the union of j -clusters that fall in the in-teraction sphere of any i -cluster. This improves arithmeticsaturation at the cost of some pairs in the list not containinginteracting particles, since all j -clusters are not shared. Tominimize this overhead, the super-clusters are kept as com-pact as possible, and the search is optimized to obtain closeto cubic cluster geometry – we use an eight-way groupingobtained from a 2 × × j -list would lead to substantial overhead ifinteractions were computed with all clusters, similarly to thechallenge with large regular tiling. We avoid this elegantly byskipping cluster pairs with no interacting particles based onan interaction bitmask stored in the the pair list. This is illus-trated in Fig. 4 where lighter-shaded j -clusters do not interactwith some of the i -clusters; e.g. j m can be skipped for i – i .As M × N is adjusted to match the execution width, the inter-action masks allow efﬁcient divergence-free skipping of entirecluster pairs. Our organization of pair-interaction calculationis similar to that used by others , with the key differencethat those approaches rely on larger ﬁxed size tiles and useother techniques to reduce the impact of large grouping.The interaction mask describes a j − i relationship, swap-ping the order of the standard i − j formulation. Consequently,the loop order is also swapped and our GPU implementa-tion uses an outer loop over the joint j -cluster list and innerloop over the eight i -clusters. This has two main beneﬁts:First, 8 bits per j -cluster is sufﬁcient to encode the interac-tion mask, instead of needing variable-length structure. Sec-ond, the force reduction becomes more efﬁcient. Since weutilize Newton’s third law to only calculate interactions once,we need to reduce forces both for i - and j -particles. At theend of an outer j iteration, all interactions of the j -particlesloaded will have been computed and the results can be reducedand stored. At the same time, accumulating the i -particle par-tial forces requires little memory (8 × j -clusters as bitmasks and enforced simultaneously withthe interaction cut-off, just as the interaction masks . In our typical target systems, on average approximately four of theeight i -clusters contain interactions with j particles. Hence,about half of the inner loop checks result in skips, and we haveobserved these to cost 8–12%, which is a rough estimate ofthe super-cluster overhead. In comparison, the normal cut-offcheck has at most 5–10% cost in the CUDA implementation.For PME simulations, we calculate the real-space Ewaldcorrection term in the kernel. On early GPU architectures (andCPUs with low FMA FLOPS) tabulated F · r is most efﬁcient.On all recent architectures, we have instead developed an an-alytical function approximation of the correction force. Thisyields better performance as it relies on FMA arithmetics de-spite the >

15% increase in kernel instruction count.Multiprocessor-level parallelism is provided by assigningeach thread block a list element that computes interactions ofall particles in a super-cluster. To avoid conditionals, sepa-rate kernels are used for different combinations of electrostat-ics and Lennard-Jones interactions and cut-offs, and whetherenergy is required or not. The cluster algorithm has been im-plemented both in CUDA and OpenCL and tuned for multipleGPU architectures. On NVIDIA and recent AMD GPUs with32-wide execution we use an 8 × ×

8, and on Intel hardware with 8-wideexecution a 4 × B. Algorithmic work eﬃciency and pair-list buﬀers

The cluster algorithm trades computing interactions knownto evaluate to zero for improved execution efﬁciency onSIMD-style architectures. To quantify the amount of addi-tional work, we calculate the parallel work efﬁciency of thealgorithm as fraction of non-zero interactions evaluated, It isworth noting that this metric is ≤ .Moreover, particles in the pair list that fall outside the bufferedcut-off can be made use of: these provide an extra implicitbuffer that allows using a shorter explicit buffer when evalu-ating pair interactions while maintaining the accuracy of thealgorithm. One example of a cluster contributing to the im-plicit buffer is illustrated in Fig. 4. Making use of this implicitbuffer increases the practical parallel efﬁciency relative to thestandard particle-based algorithm. With a box of SPC/E wa-ter where a 0 . × .

218 nm to reachthe same error tolerance as the 8 × .

105 nm buffer.We believe this is a striking example of the importance of i i i i i j m Figure 4. Cluster pair setups with four particles ( N = M = i (green). Clusters with ﬁlled circles have interactions within the buffered cut-off (green dashed line) of at leastone particle in i , while particles in clusters intersected by the buffered cut-off that fall outside of it represent an extra implicit buffer. Right:Hierarchical super-clusters on GPUs. Clusters i – i (green, magenta, red, and blue) are grouped into a super-cluster. Dashed lines representbuffered cut-offs of each i -cluster. Clusters with any particle in any region will be included in the common pair list. Particles of j -clustersin the joint list are illustrated by discs ﬁlled in black to gray; black indicates clusters that interact with all four i -clusters, while lighter grayshading indicates a cluster only interacts with 1–3 i -cluster(s), e.g. j m only with i . w o r k e ff i c i en cy ( % ) Figure 5. Algorithmic work-efﬁciency of the particle (1 ×

1) andcluster (8 ×

4) Verlet approaches, deﬁned as the fraction of interac-tions calculated that are within the actual cut-off, shown as functionof buffer size for three common cut-off distances. The trade-off isthat larger cluster sizes enable greater computational efﬁciency, andincreased buffers enable longer reuse of the pair list. moving to tolerance-based settings instead of rule-of-thumbor heuristics to control accuracy. All algorithms in a simu-lation affect the accuracy of the ﬁnal results, and while theacceptable error varies greatly between problems, it will bedominated by the worst part of the algorithm – there is littlebeneﬁt from evaluating only some parts more accurately. Tocontrol the effect of missing pair interactions close to the cut-off, our implementation estimates the Verlet buffer for a givenupper bound for the error in the energy. The estimate is basedon the particle masses, temperature, pair interaction functionsand constraints, also taking into account the cluster setup andits implicit buffering . Since ﬁrst introduced, we have reﬁnedthe buffer estimate to account for constrained atoms rotatingaround the atom they are constrained to rather than movinglinearly, which allows tighter estimates for long list lifetimes.The upper bound for the maximum drift can be provided bythe user as a tolerance setting in the simulation input. We use0 .

005 kJ / mol / ps per atom as default, but the tolerance andhence drift can be arbitrarily small. For the default setting, theimplicit buffer turns out to be sufﬁcient for a water system or solvated biomolecule with PME electrostatics and 20 fs pair-list update intervals, and no extra explicit buffer is thus re-quired in this case. The actual energy drift caused by thesesettings is 0 . / mol / ps per atom, a factor of 5 smallerthan the upper bound. For comparison, typical constraint al-gorithms result in drifts around 0 . / mol / ps per atom,so being signiﬁcantly more conservative than this will usuallynot improve the overall error in a simulation.We see several advantages to this approach, and would ar-gue the ﬁeld in general should move to requested tolerancesinstead of heuristic settings. First, the user can set a singleparameter that is easier to reason about and that will be validacross systems and temperatures. Second, it will enable in-novation in new algorithms that maintain accuracy (instead ofperformance improvements becoming a race towards the leastaccurate implementation). Finally, since other parameters canbe optimized freely for the input and run conditions, we canmake use if this for advanced load balancing to safely devi-ate from classical setups by relying on maintaining accuracyrather than arbitrary settings as described below. C. Non-bonded pair interaction kernel throughput

The throughput of the cluster-based pair interaction algo-rithm depends on the number of interactions per particle,hence particle density. This varies across applications, fromcoarse-grained to all-atom bio-molecular systems or liquid-crystal simulations. The effective pair throughput (countingonly non-zero interactions) is also inﬂuenced by the bufferlength and the conditionally-enforced cut-off in the GPU ker-nels. When comparing CUDA GPU and AVX512 CPU ker-nels with same-size clusters (identical work-efﬁciency), theraw throughput reaches peak performance already from ∼ ∼ i -particle data reuse with longer j -lists. The effec-tive pair throughput shows a monotonic increase, since morepairs will be inside the cut-off with more particles in the in-

10 100 1000 3000pairs in half cut-off sphere0.1110100200 pa i r t h r oughpu t ( G i n t e r a c t i on s / s ) GPU CPU raw, buf = 0 nmrreffective, buf = 0 nmraw, buf = 0.1 nmeffective, buf = 0.1 nmrr all-atomCG / gas

Figure 6. Non-bonded pair interaction throughput of CUDA GPUand AVX512 CPU kernels as a function of particles in the half cut-offsphere. The raw throughput includes zero interactions, while effec-tive throughput only counts non-zero interactions. Pair ranges typicalto all-atom and coarse-grained/gas simulations are indicated. Mea-surements were done using a 157k particle Lennard-Jones system tominimize input size effects (density ρ = − , σ = . teraction range, as expected from Fig. 5. For typical all-atomsimulations, the effective GPU kernel throughput gets close to100 Ginteractions / s while the corresponding throughput on a20-core CPU is an order-of-magnitude lower.To provide good performance for small systems and enablestrong scaling, it is important to achieve high kernel efﬁciencyalready at limited particle counts. To illustrate performance asa function of system size for all-atom systems, Fig. 7 showsactual pair throughput for SPC/E water systems with 1 nm cut-off, Ewald electrostatics, 40 step search frequency and defaulttolerances. The GROMACS CPU kernels achieve peak per-formance already around 3000 atoms (using up to 40 threads),and, apart from the largest devices, within 10% of peak GPUpair throughput is reached around 48k atoms. GPU through-put is up to seven-fold higher at peak than on CPUs and al-though it suffers signiﬁcantly with smaller inputs (up to ﬁve-fold lower than peak), for all but the very smallest systems theGPU kernels reach higher absolute throughput. The challengefor small systems is the overhead incurred from kernel invo-cation and other ﬁxed-cost operations. Pair list balancing alsocomes at a slight cost, and while it is effective when there areenough lists to split, it is limited by the amount of work rela-tive to the size of the GPU. The sub-10k atom systems simplydo not have enough parallelism to execute in a balanced man-ner on the largest GPUs. In contrast, CPU kernels exhibit aslight decrease for large input due to cache effects. D. The pair list generation algorithm

GROMACS uses a ﬁxed pair list lifetime instead of heuris-tic updates based on particle displacement, since the Maxwell-Boltzmann distribution of velocities means the likelihood ofrequesting a pair list update at a step approaches unity as thesize of the system increases. In addition, the accuracy-basedbuffer estimate allows the pair search frequency to be picked

NVIDIA RTX 2080NVIDIA Quadro GP100NVIDIA Quadro P6000NVIDIA Tesla V100AMD Radeon Instinct Mi50 AMD Vega FEIntel Core i9 7920X (24T)AMD Ryzen 9 3900X (24T)Intel Xeon Gold 6148 (40T) system size (x1000 atoms) F o r c e k e r n e l t h o r u h g p u t ( a t o m s / μ s ) Figure 7. Force-only pair interaction kernel performance as a func-tion of input size. The throughput indicates how CPU kernels haveless overhead for small systems, while GPU kernels achieve sig-niﬁcantly higher throughput from moderate inputs sizes. Multiplegenerations of consumer and professional CPU and GPU hardwareare shown; CUDA kernels are used on NVIDIA GPUs, OpenCL onAMD, AVX2 and AVX512 on the AMD and Intel CPUs, respec-tively. All cores and threads were used for CPUs. freely (it will only inﬂuence the buffer size), unlike the classi-cal approach that requires it to be carefully chosen consideringsimulation settings and run conditions.We early decided to keep the pair search on the CPU. Giventhe complex algorithms involved, our main reason was toensure parallelization, portability and ease of maintenance,while getting good performance (reducing GPU idle time)through algorithmic improvements. The search uses a SIMD-optimized implementation and, to further reduce its cost, thehierarchical GPU list is initially built using cluster bounding-box distances avoiding expensive all-to-all particle distancechecks . A particle-pair distance based list pruning is car-ried out on the GPU which eliminates non-interacting clusterpairs. This can also further adapt the setup to the hardwareexecution width by splitting j -clusters; e.g. for current IntelGPUs the search produces a 4 × × E. Dual pair list with dynamic pruning

Domain decomposition and pair list generation rely on ir-regular data access, and their performance has not improved atthe same rate as compute kernels. Trading their cost for morepair interaction work through increasing the search frequencyhas drawbacks. First, as the buffer increases the overhead be-comes large (Fig. 5). The trade-off is also sensitive to inputsand runtime conditions, with a small optimal window. In or-0der to address this, we have developed an extension to thecluster algorithm with a dual pair list setup that uses a longerouter and a short inner list cut-off. The outer list is built veryinfrequently, while frequent pruning steps produce a pair listbased on the inner cut-off, typically with close to zero explicitbuffer. As pruning operates on regularized particle data lay-out produced by the pair search, it comes at a much lower cost(typically <

1% of the total runtime) than using a long buffer-based pair list in evaluating pair interactions. This avoids theprevious trade-off and reduces the cost of search and DD with-out force computation overhead. With GPUs, pruning is donein a rolling fashion scheduled in chunks between force com-putations of consecutive MD steps, which allows it to overlapother work such as CPU integration (Fig. 3).

F. Multi-level load balancing

Both data and task decomposition contribute to load im-balance on multiple levels of parallelism. Sources of data-parallel imbalance include inherently irregular pair interac-tion data (varying list sizes), non-uniform particle density (e.g.membrane protein simulations using united-atom lipids) re-sulting in non-bonded imbalance across domains, and non-homogeneously distributed bonded interactions (solvent doesnot have as many bonds as a protein). With task-parallelismwhen using MPMD or GPU ofﬂoad, task load imbalance isalso a source of imbalance between MPI ranks or CPU andGPU. Certain algorithmic choices like pair search frequencyor electrostatics settings can shift load between parts of thecomputation, whether running in parallel or not.In particular for small systems it is a challenge to balancework between the compute units of high-performance GPUs.Especially with irregular work, there will be a kernel “tail”where only some compute units are active, which leads to in-efﬁcient execution. In addition, with small domains per GPUwith DD there may be too few cluster lists to process in a bal-anced manner. To reduce this imbalance and the kernel tailit would be desirable to control block scheduling, but this ispresently not possible on GPUs . Instead, we tune schedul-ing indirectly through indexing order. The GPU pair inter-action work is reshaped by sorting to avoid long lists beingscheduled late. In addition, when the number of pair lists istoo low for efﬁcient execution for given hardware, we heuris-tically split lists to increase the available parallelism.Kernel tail effects can be also be mitigated by overlappingcompute kernels. This requires enough concurrent work avail-able to ﬁll idle GPU cores and that it is expressed such thatparallel execution is possible. GPU APIs do not allow ﬁne-grained control of kernel execution and instead the hardwarescheduler decides on an execution strategy. By using multiplestreams/queues with asynchronous event-based dependencies,our GPU schedule is optimized to maximize the opportunityfor kernel overlap. This reduces the amount of idle GPU re-sources due to kernel tails as well as those caused by schedul-ing gaps during a sequence of short dependent kernels (e.g.the 3D-FFT kernels used in PME).With PME electrostatics, the split into real and reciprocal w a ll - t i m e ( m s ) Total stepCPU force totalCPU PME meshGPU nonbondedCPU bonded

Figure 8. CPU–GPU load balancing between short- and long-rangenon-bonded force tasks used. The PME load balancing seeks to min-imize total wall-time, here at 1 .

226 nm (green dashed circle), byincreasing the electrostatics direct-space cut-off while also scalingPME grid spacing. This shifts load from the CPU PME task (bluedashed) to the non-bonded GPU task (red). System: Alcohol de-hydrogenase (95k atoms, 0 . space provides opportunities to rebalance work at constanttolerance by scaling the cut-off together with the PME gridspacing . This was ﬁrst introduced as part of or MPMDapproach with an external tool . This load balancing waslater automated and implemented as an online load balancer ,originally to allow shifting work from the CPU to the GPU.This approach now works remarkably well also with dedicatedPME ranks both in CPU-only runs and when using multipleGPUs. The load balancer is automatically triggered at startupand scans through cut-off and PME grid combinations (Fig. 8)using CPU cycle counters to ﬁnd the highest-performance al-ternative. The load balancer typically converges in a few thou-sand steps, apart from noisy environments such as multi-noderuns with network contention. Signiﬁcant efforts have beenmade to ensure the robustness of the algorithm. It accountsfor measurement noise, avoids cache effects, mitigates inter-ference of CPU/GPU frequency scaling, and it reduces unde-sirable interaction with the DD load balancer (which couldchange the domain size while the cut-off is scaled).Nevertheless, load balancing comes with a trade-off interms of increased communication volume. In addition, lin-earithmic (FFT) or linear (kernel) time-complexity reciprocal-space work is traded for quadratic time complexity real-spacework (Fig. 8). To mitigate waste of energy we impose a cut-off scaling threshold to avoid increasing GPU load in heavilyCPU-bound runs. The performance gain from PME load bal-ancing depends on the hardware; with balanced CPU–GPUsetups it is up to 25%, but in highly imbalanced cases muchlarger speedups can be observed .The dual pair list algorithm allows us to avoid most of thedrawbacks of shifting work to direct space, since the list prun-ing is signiﬁcantly cheaper than evaluating interactions. Thismakes the balancing less sensitive and easier to use, and wherethe previous approach saturated around pair list update inter-vals of 50-100 steps, the dual list with pruning can allow hun-dreds of steps, which is particularly useful in reducing CPU1 Figure 9. Performance evolution of GROMACS from the ﬁrst ver-sion with heterogeneous parallelism support on identical hardware.Performance is shown for the 142k atom GluCl benchmark on twohardware conﬁguration with varying CPU–GPU balance using oneIntel Xeon E5 2620v4 CPU and NVIDIA GeForce GTX 1080 / GTX1080Ti GPUs. load to maximize GPU utilization in runs that ofﬂoad the en-tire inner iteration (Fig. 3).Finally, the domain decomposition achieves load balanc-ing by resizing spatial domains, thereby redistributing parti-cles between domains and shifting work between MPI ranks.With force ofﬂoad the use of GPUs is largely transparent tothe code, but extensions to the DLB algorithm were necessary.Support for timing concurrent GPU tasks is limited, in partic-ular in CUDA. We account for GPU work in DLB through thewall-time the CPU spends waiting for results, labeled accord-ingly on the CPU timeline in Fig. 3. However, this can in-troduce jitter when a GPU is assigned to multiple MPI ranks.GPUs are not partitioned across MPI ranks, but work is sched-uled in an undeﬁned order and executed until completion (un-less preempted). Hence, the CPU wait can only be system-atically measured on some of the ranks sharing a GPU whilenot on others. This leads to spurious imbalance and to avoid itwe redistribute the CPU wall-time spent waiting for the GPUevenly across the MPI ranks assigned to the same GPU to re-ﬂect the average GPU load and eliminate execution order bias.

VI. PERFORMANCE BENCHMARKS

Faster hardware has been a blessing for simulations, butas shown in Fig. 9 the improvements in algorithms and soft-ware described here have at least doubled performance for thesame hardware even for older cost-efﬁcient GPU hardware,and with latest-generation consumer cards the improvementis almost fourfold over the last ﬁve years. Given the low-end CPU and high-end GPU combination, new ofﬂoad modesbring signiﬁcant performance improvements when ofﬂoadingeither only PME or the entire inner iteration to the accelerator.Figure 10 shows the impact of different ofﬂoad setups forsingle-GPU runs. As expected, the CPU-only run scales withthe number of cores. When the non-bonded task is ofﬂoaded,the performance increases signiﬁcantly for both systems, butit does not saturate even when using all cores - indicating theCPU is oversubscribed. This is conﬁrmed by the large jumpin performance when the PME task too is ofﬂoaded. The GPUnow becomes the bottleneck for computations, and the curvessaturate when enough CPU cores are used - adding more P e r f o r m an c e ( n s / da y ) CPUNonbondedNonbonded and PMENonbonded, bonded and PMENonbonded, bonded, PME and updateNonbonded, PME and update

A B

Figure 10. Single-GPU performance for (A) RNAse (24k atoms) and(B) GluCl ion channel (142k atoms) systems, both with AMBER99force-ﬁeld. Different ofﬂoad setups are illustrated, with the tasksassigned to the GPU listed in the legend. Hardware: AMD Ryzen3900X, NVIDIA GeForce RTX 2080 Super. will not aid performance. Consequently, when the bondedforces are also ofﬂoaded, there is a performance regression,in particular for the membrane protein system with lots of tor-sions. (Figure 10B). With GPU force tasks taking longer thanthe CPU force tasks, the data transfers between host and de-vice are no longer effectively overlapping with compute tasks.This is solved by ofﬂoading the entire innermost MD loop,including coordinate constraining and updating. This leads toanother signiﬁcant jump in performance, despite the CPU nowbeing mostly idle. To make use of this idle resource, one canmove the bonded force evaluation from the GPU back to theCPU. This is beneﬁcial when the entire cycle is faster than theevaluation of non-bonded and PME forces on the GPU. For allresults the cross-over points will depend on the system, but itis generally faster to evaluate bonded forces on the CPU whenapart from very dense systems where only a single CPU coreis available per GPU. C u m u l a t i v e pe r f o r m an c e ( n s / da y ) Nonbonded and PMENonbonded, bonded and PMENonbonded, bonded, PME and updateNonbonded, PME and update

A B

Figure 11. Ensemble run cumulative performance as a function ofnumber of Accelerated Weight Histogram walkers executed simul-taneously for different ofﬂoad scenarios. The benchmark systemis aquaporin (109992 atoms per ensemble member, CHARMM36force-ﬁeld). The performance was measured on a dual Intel XeonE5-2620 v4 CPU server with four NVIDIA GeForce GTX 1080GPUs (A) and four NVIDIA GeForce RTX 2080 GPU (B). on a singlenode equipped with two CPUs and four GPUs. With medium-performance GPUs, the best performance is achieved when aCPU is used for computing the bonded forces, with everythingelse evaluated on the GPU (Figure 11A), and the most efﬁcientthroughput is obtained with four ensemble members runningon each GPU. Using the GPU for the full MD loop is still thesecond best case when only a single run is executed on theGPU, but for many runs per GPU there are more options foroverlapping transfer and compute tasks when using the CPUfor updates (integration) and/or bonded forces. However, forthe somewhat older GPUs the difference between the worst-and best-performing cases is only about 25%. When pairingthe same older CPUs with recent GPUs ((Figure 11B), thebalance changes appreciably, and it is no longer justiﬁed touse the CPUs even for the lightest compute tasks. Performingall tasks on the GPU more than doubles performance com-pared to leaving updates and bonded forces on the CPU. An-other advantage is that the throughput does not change signif-icantly with more ensemble members per GPU, which allowsfor greater ﬂexibility, not to mention the absolute performancewill always be highest when only a single ensemble memberis assigned to each GPU.Finally, the work on direct GPU communication now alsoenables quite efﬁcient multi-GPU scaling combined with out-standing absolute performance. Fig. 12 illustrates the effectof the direct GPU communication optimizations on perfor-mance through results from running the Satellite tobacco mo-saic virus (STMV) benchmark (1M atoms, 2 fs steps) on upto four compute nodes, each equipped with four NVIDIATelsa V100 GPUs per node. Intra-node communication usesNVLink and inter-node communication MPI over Inﬁniband.We believe this conﬁguration is a good match to emergingnext-generation HPC systems, which have a focus on goodbalance for mixed workloads. When using reaction-ﬁeld in-stead of PME, the scaling is excellent all the way to 16 GPUs.While this is a less common choice for electrostatics, it high-lights the efﬁciency and beneﬁts of the GPU halo exchangecombined with GPU update, and shows the performance andscaling possible when avoiding the challenges with small 3DFFTs, extra communication between direct- and reciprocal-space GPUs, and task imbalance inherent to PME. When us-ing PME electrostatics, the relative scaling is good up to 8GPUs (50% efﬁciency) when the GPU halo exchange is com-bined with the direct GPU PME task communication, andthere are again clear beneﬁts from combination with the GPUupdate path. Beyond this we are currently limited by the re-striction of a single PME GPU when ofﬂoading lattice sum-mation, which becomes a bottleneck both in terms of commu- P e r f o r m an c e ( n s / da y ) Direct GPU comm. with GPU updateDirect GPU comm.Staged GPU comm.

A BRF PME

Figure 12. Multi-GPU and multi-node scaling performance STMVbenchmark (1M atoms). Performance when using staged commu-nication through the CPU, direct GPU communication, and the ad-ditional beneﬁt of GPU integration are shown. Left: When usingreaction-ﬁeld to avoid lattice summation, the scaling is excellent andwe achieve iteration rates around 1 ms of wallclock time per step over4 nodes with 4 GPUs each. Right: With PME, the scaling currentlybecomes limited by the use a single PME GPUs for ofﬂoading, butthe absolute performance is high (despite the different scales). Allruns use 1 MPI task per GPU, except the 2-GPU PME runs that use4 MPI ranks to improve task balance. nication and imbalance in computation. However, we believethe absolute performance of 55 ns / day is excellent. VII. DISCUSSION

Microsecond-scale simulations have not only become rou-tine but eminently approachable with commodity hardware.However, it is only the barrier of entry that has been low-ered. Bridging the time-scale gap from hundreds of microsec-onds to millisecond in single-trajectories still requires special-purpose hardware . Nevertheless, general-purpose codeshave unique advantages in terms of ﬂexibility, adaptability andportability to new hardware – not to mention it is relativelystraightforward to introduce new special algorithms e.g. forincluding experimental constraints in simulations. There arealso great opportunities with using massive-scale resourcesfor efﬁcient ensemble simulations where the main challengeis cost-efﬁcient generation of trajectories. Hence, we believethat a combination of new algorithms, efﬁcient heterogeneousparallelization and large-scale ensemble algorithms will char-acterize MD in the Exascale era - and performance advancesin the core MD codes will always multiply the advances ob-tained from new cleaver sampling algorithms.The ﬂexibility of the GROMACS engine comes with chal-lenges both for developers and users. There is a range of op-tions to tune, from algorithmic parameters to parallelizationsettings, and many of these are related to fundamental shiftstowards much more diverse hardware that the entire MD com-munity has to adapt to. Our approach has been to provide abroad range of heuristics-based defaults to ensure good perfor-mance out of the box, but by increasingly moving to tolerance-based settings we aim to both improve quality of simulationsand make life easier for users. Still, efﬁciently using a multi-3tude of different hardware combinations for either throughputor single long simulations is challenging. As described here,many of the steps have been automated, but to dynamicallydecide e.g. the algorithm to resource mapping in a complexcompute node (or what code ﬂavor to use) requires a fullydynamic auto-tuning approach . Developing a robust auto-tuning framework and integrating it with the multi-level loadbalancers is especially demanding due to complex feature setand broad use-cases of codes such as GROMACS, but it issomething we are working actively on.To reach performance for which ASICs are needed today,MD engines need to be capable of <

100 µs iterations. Such it-eration rates are possible with GROMACS for small systems,like the villin headpice ( ∼ . Reaching this per-formance for larger systems like membrane proteins with hun-dreds of thousands of atoms will require a range of improve-ments. In terms of parallelization in GROMACS, improvingthe efﬁciency of GPU task scheduling, CPU tasking and bet-ter overlap of communication are necessary. When it comesto algorithms, we expect PME to remain the long-range in-teraction method of choice at low scale, but the limitations ofthe 3D FFT many-to-many communication for strong scalingrequires a new approach. With recent extensions to the fastmultipole method , we expect it to become the algorithmof choice for the largest parallel runs. Future technologicalimprovements, including faster interconnects and closer on-chip integration as well as advances in both traditional andcoarse-grained reconﬁgurable architectures could allow get-ting closer to this performance target.We expect to see several of these advancements in the fu-ture. In the mean time, we believe the present GROMACSimplementation provides a major step forward in terms of ab-solute performance as well as relative scaling by being ableto use almost arbitrary-balance combinations of CPU andGPU functional units. It is our hope this will help enable awide range of scientiﬁc applications on everything from cost-efﬁcient consumer GPU hardware to large HPC resources. DATA AVAILABILITY

The data that support the ﬁndings of this study are availablein the supplementary material and in the following reposito-ries: methodology, topologies, input data and parameters forbenchmarks are published at https://doi.org/10.5281/zenodo.3893789 ; the source code for multi-GPU scal-ing is available at https://doi.org/10.5281/zenodo.3890246 . ACKNOWLEDGMENTS

This work was supported by the Swedish e-Science Re-search Center, the BioExcel CoE (H2020-EINFRA-2015-1-675728), the European Research Council (209825,258980)the Swedish Research Council (2017-04641, 2019-04477),and the Swedish Foundation for Strategic Research. NVIDIA,Intel and AMD are kindly acknowledged for engineering and hardware support. We thank Gaurav Garg (NVIDIA)for CUDA-aware MPI contributions, Aleksei Lupinov andRoland Schulz (Intel) for heterogeneous parallelization,Stream HPC for OpenCL contributions, and the project wouldnot be possible without all contributions from the greaterGROMACS community.

REFERENCES C. Kutzner, S. Páll, M. Fechner, A. Esztermann, B. L. Groot, and H. Grub-müller, “More bang for your buck: Improved use of GPU nodes for GRO-MACS 2018,” Journal of Computational Chemistry , 2418–2431 (2019),1903.05918. H. H. Loefﬂer and M. D. Winn, “Large biomolecular simulation on HPCplatforms II. DL POLY, Gromacs, LAMMPS and NAMD,” Tech. Rep. Md(STFC, 2012). M. Schaffner and L. Benini, “On the Feasibility of FPGA Accelerationof Molecular Dynamics Simulations,” Tech. Rep. (ETH Zurich, IntegratedSystems Lab IIS, 2018) arXiv:1808.04201. N. Yoshii, Y. Andoh, K. Fujimoto, H. Kojima, A. Yamada, and S. Okazaki,“MODYLAS: A highly parallelized general-purpose molecular dynamicssimulation program,” International Journal of Quantum Chemistry , n/a–n/a(2014). W. M. Brown, A. Semin, M. Hebenstreit, S. Khvostov, K. Raman, and S. J.Plimpton, “Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efﬁciency,” in

International Conferencefor High Performance Computing, Networking, Storage and Analysis, SC (IEEE Press, 2016) pp. 82–95. W. McDoniel, M. Höhnerbach, R. Canales, A. E. Ismail, and P. Bientinesi,“LAMMPS’ PPPM long-range solver for the second generation Xeon Phi,”in

Lecture Notes in Computer Science (including subseries Lecture Notesin Artiﬁcial Intelligence and Lecture Notes in Bioinformatics) , Vol. 10266LNCS, edited by J. M. Kunkel, R. Yokota, P. Balaji, and D. Keyes (SpringerInternational Publishing, Cham, 2017) pp. 61–78, 1702.04250. B. Acun, D. J. Hardy, L. V. Kale, K. Li, J. C. Phillips, and J. E. Stone,“Scalable molecular dynamics with NAMD on the Summit system,” IBMJournal of Research and Development , 4:1–4:9 (2018). B. Hess, C. Kutzner, D. Van Der Spoel, and E. Lindahl, “GROMACS4: Algorithms for Highly Efﬁcient, Load-Balanced, and Scalable Molecu-lar Simulation,” Journal of Chemical Theory and Computation , 435–447(2008). S. Páll and B. Hess, “A ﬂexible algorithm for calculating pair interactionson SIMD architectures,” Computer Physics Communications , 2641–2650 (2013), 1306.1737. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, andE. Lindahl, “GROMACS: High performance molecular simulations throughmulti-level parallelism from laptops to supercomputers,” SoftwareX , 1–7(2015). S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lin-dahl, “GROMACS 4.5: a high-throughput and highly parallel open sourcemolecular simulation toolkit.” Bioinformatics (Oxford, England) , 845–54 (2013). M. J. Harvey, G. Giupponi, and G. D. Fabritiis, “ACEMD: AcceleratingBiomolecular Dynamics in the Microsecond Time Scale,” Journal of Chem-ical Theory and Computation , 1632–1639 (2009). A. W. Götz, M. J. Williamson, D. Xu, D. Poole, S. Le Grand, and R. C.Walker, “Routine Microsecond Molecular Dynamics Simulations with AM-BER on GPUs. 1. Generalized Born.” Journal of chemical theory and com-putation , 1542–1555 (2012). R. Salomon-Ferrer, A. W. Goetz, D. Poole, S. Le Grand, and R. C. Walker,“Routine microsecond molecular dynamics simulations with AMBER onGPUs. 2. Explicit Solvent Particle Mesh Ewald,” Journal of Chemical The-ory and Computation , 130806125930006 (2013). J. a. Anderson, C. D. Lorenz, and A. Travesset, “General purpose molecu-lar dynamics simulations fully implemented on graphics processing units,”Journal of Computational Physics , 5342–5359 (2008). N. Ganesan, M. Taufer, B. Bauer, and S. Patel, “FENZI: GPU-EnabledMolecular Dynamics Simulations of Large Membrane Regions Based onthe CHARMM Force Field and PME,” 2011 IEEE International Sympo-sium on Parallel and Distributed Processing Workshops and Phd Forum ,472–480 (2011). M. Bergdorf, S. Baxter, C. A. Rendleman, and D. E. Shaw, “Desmond/GPUPerformance as of November 2016,” Tech. Rep. (D. E. Shaw Research,2016). J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, andK. Schulten, “Accelerating molecular modeling applications with graphicsprocessors,” Journal of Computational Chemistry , 2618–2640 (2007). W. M. Brown, A. Kohlmeyer, S. J. Plimpton, and A. N. Tharrington, “Im-plementing molecular dynamics on hybrid high performance computers –Particle–particle particle-mesh,” Computer Physics Communications ,449–459 (2012). A.-P. Hynninen and M. F. Crowley, “New faster CHARMM molec-ular dynamics engine.” Journal of computational chemistry (2013),10.1002/jcc.23501. C. Kobayashi, J. Jung, Y. Matsunaga, T. Mori, T. Ando, K. Tamura,M. Kamiya, and Y. Sugita, “GENESIS 1.1: A hybrid-parallel moleculardynamics simulator with enhanced sampling algorithms on multiple com-putational platforms,” Journal of Computational Chemistry , 2193–2206(2017). S. Páll, M. J. Abraham, C. Kutzner, B. Hess, and E. Lindahl, “Tackling Ex-ascale Software Challenges in Molecular Dynamics Simulations with GRO-MACS,” in

Solving Software Challenges for Exascale , Lecture Notes inComputer Science, Vol. 8759, edited by S. Markidis and E. Laure (Springer,Cham, 2015) pp. 3–27. C. Kutzner, R. Apostolov, and B. Hess, “Scaling of the GROMACS 4.6molecular dynamics code on SuperMUC,” in

International Conference onParallel Computing - ParCo2013 (2013). J. R. Perilla and K. Schulten, “Physical properties of the HIV-1 capsidfrom all-atom molecular dynamics simulations,” Nature Communications(2017), 10.1038/ncomms15959. T.-S. S. Lee, D. S. Cerutti, D. Mermelstein, C. Lin, S. Legrand, T. J. Giese,A. Roitberg, D. A. Case, R. C. Walker, and D. M. York, “GPU-AcceleratedMolecular Dynamics and Free Energy Methods in Amber18: PerformanceEnhancements and New Features,” Journal of Chemical Information andModeling , 2043–2050 (2018). K. Rupp, “CPU, GPU and MIC Hardware Characteristics over Time,”(2019). D. Van Der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, andH. J. C. Berendsen, “GROMACS: fast, ﬂexible, and free.” Journal of com-putational chemistry , 1701–18 (2005). S. Pronk, E. Lindahl, and P. M. Kasson, “Coupled Diffusion in Lipid Bi-layers upon Close Approach.” Journal of the American Chemical Society(2015), 10.1021/ja508803d. Lindahl, Abraham, Hess, and van der Spoel, “Gromacs 2020.2 sourcecode,” (2020). R. Schulz, B. Lindner, L. Petridis, and J. C. Smith, “Scaling ofmultimillion-atom biological molecular dynamics simulation on a petas-cale supercomputer,” Journal of Chemical Theory and Computation (2009),10.1021/ct900292r. F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin,G. Mercier, S. Thibault, and R. Namyst, “hwloc: A Generic Frameworkfor Managing Hardware Afﬁnities in HPC Applications,” in (IEEE, 2010) pp. 180–186. C. Kutzner, S. Páll, M. Fechner, A. Esztermann, B. L. de Groot, andH. Grubmüller, “Best bang for your buck: GPU nodes for GROMACSbiomolecular simulations,” Journal of Computational Chemistry , 1990–2008 (2015), 1507.00898. J. E. Stone, A.-P. Hynninen, J. C. Phillips, and K. Schulten, “EarlyExperiences Porting the NAMD and VMD Molecular Simulation andAnalysis Software to GPU-Accelerated OpenPOWER Platforms,” in

Re-search.Ibm.Com , Lecture Notes in Computer Science, Vol. 9137, editedby J. M. Kunkel and T. Ludwig (Springer International Publishing, Cham,2016) pp. 188–206. L. Verlet, “Computer "Experiments" on Classical Fluids. I. Thermodynam-ical Properties of Lennard-Jones Molecules,” Physical Review , 98–103(1967). R. Hockney, S. Goel, and J. Eastwood, “Quiet high-resolution com-puter models of a plasma,” Journal of Computational Physics , 148–158(1974). J. Yang, Y. Wang, and Y. Chen, “GPU accelerated molecular dynamics sim-ulation of thermal conductivities,” Journal of Computational Physics ,799–804 (2007). J. van Meel, A. Arnold, D. Frenkel, S. Portegies Zwart, and R. Belleman,“Harvesting graphics power for MD simulations,” Molecular Simulation , 259–266 (2008). M. S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand,A. L. Beberg, D. L. Ensign, C. M. Bruns, and V. S. Pande, “Acceleratingmolecular dynamic simulation on graphics processing units.” Journal ofcomputational chemistry , 864–72 (2009). S. Pennycook, C. Hughes, M. Smelyanskiy, and S. Jarvis, “ExploringSIMD for Molecular Dynamics, Using Intel Xeon Processors and IntelXeon Phi Coprocessors,” (2013). P. Gonnet, “A simple algorithm to accelerate the computation of non-bonded interactions in cell-based molecular dynamics simulations.” Journalof computational chemistry , 570–3 (2007). P. Eastman and V. S. Pande, “Efﬁcient nonbonded interactions for molec-ular dynamics on a graphics processing unit.” Journal of ComputationalChemistry , 1–5 (2009). U. Welling and G. Germano, “Efﬁciency of linked cell algorithms,” Com-puter Physics Communications , 611–615 (2011). N. Bailey, T. Ingebrigtsen, J. S. Hansen, A. Veldhorst, L. Bøhling,C. Lemarchand, A. Olsen, A. Bacher, L. Costigliola, U. Pedersen,H. Larsen, J. Dyre, and T. Schrøder, “RUMD: A general purpose molec-ular dynamics package optimized to utilize GPU hardware down to a fewthousand particles,” SciPost Physics , 038 (2017), 1506.05094. L. Barba and R. Yokota, “How Will the Fast Multipole Method Fare in theExascale Era?” SIAM News , 8–9 (2013). J. Glaser, T. D. Nguyen, J. a. Anderson, P. Lui, F. Spiga, J. a. Millan, D. C.Morse, and S. C. Glotzer, “Strong scaling of general-purpose moleculardynamics simulations on GPUs,” Computer Physics Communications ,97–107 (2015), 1412.3387. M. J. Abraham and J. E. Gready, J. Comput. Chem. , 2031–2040 (2011). V. Lindahl, J. Lidmar, and B. Hess, “Accelerated weight histogram methodfor exploring free energy landscapes,” The Journal of Chemical Physics , 044110 (2014). I. Ohmura, G. Morimoto, Y. Ohno, A. Hasegawa, and M. Taiji,“MDGRAPE-4: a special-purpose computer system for molecular dynam-ics simulations,” Phil Trans R Soc A , 20130387– (2014). J. Grossman, B. Towles, B. Greskamp, and D. E. Shaw, “Filtering, Reduc-tions and Synchronization in the Anton 2 Network,” in (IEEE, 2015) pp.860–870. . / day ( < / step) on an AMD R9-3900X CPU and NVIDIA RTX2080 SUPER GPU; using 2 fs time step and 0 . D. S. Shamshirgar, R. Yokota, A. K. Tornberg, and B. Hess, “Regulariz-ing the fast multipole method for use in molecular simulation,” Journal ofChemical Physics , 1–9 (2019). C. Yang, T. Geng, T. Wang, R. Patel, Q. Xiong, A. Sanaullah, C. Wu,J. Sheng, C. Lin, V. Sachdeva, W. Sherman, and M. Herbordt, “Fullyintegrated FPGA molecular dynamics simulations,” in

Proceedings of theInternational Conference for High Performance Computing, Networking,Storage and Analysis (ACM, New York, NY, USA, 2019) pp. 1–31. N. Srivastava, H. Rong, P. Barua, G. Feng, H. Cao, Z. Zhang, D. Albonesi,V. Sarkar, W. Chen, P. Petersen, G. Lowney, A. Herr, C. Hughes, T. Matt-son, and P. Dubey, “T2s-tensor: Productively generating high-performancespatial hardware for dense tensor computations,” in (2019) pp. 181–189. S. Páll, “Supplementary Information for Heterogeneous Parallelization andAcceleration of Molecular Dynamics Simulations in GROMACS,” (2020).55