Porting WarpX to GPU-accelerated platforms
A. Myers, A. Almgren, L. D. Amorim, J. Bell, L. Fedeli, L. Ge, K. Gott, D. P. Grote, M. Hogan, A. Huebl, R. Jambunathan, R. Lehe, C. Ng, M. Rowan, O. Shapoval, M. Thévenet, J.-L. Vay, H. Vincenti, E. Yang, N. Zaïm, W. Zhang, Y. Zhao, E. Zoni
PPorting WarpX to GPU-accelerated platforms
A. Myers a , A. Almgren a , L. D. Amorim a , J. Bell a , L. Fedeli e , L. Ge b,a , K. Gott a , D. P. Grote c ,M. Hogan b , A. Huebl a , R. Jambunathan a , R. Lehe a , C. Ng b , M. Rowan a , O. Shapoval a ,M. Thévenet d , J.-L. Vay a , H. Vincenti e , E. Yang a , N. Zaïm e , W. Zhang a , Y. Zhao a and E. Zoni a a Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA b SLAC National Accelerator Laboratory Menlo Park, CA 94025, USA c Lawrence Livermore National Laboratory, Livermore, CA 94550, USA d Deutsches Elektronen Synchrotron (DESY), Hamburg, Hamburg 22607, Germany e LIDYL, CEA-Université Paris-Saclay, CEA Saclay, 91 191 Gif-sur-Yvette, France
A R T I C L E I N F O
Keywords :Exascale ComputingParticle-in-Cell MethodsAccelerator Modelling
A B S T R A C T
WarpX is a general purpose electromagnetic particle-in-cell code that was originally designed to runon many-core CPU architectures. We describe the strategy followed to allow WarpX to use the GPU-accelerated nodes on OLCF’s Summit supercomputer, a strategy we believe will extend to the upcom-ing machines Frontier and Aurora. We summarize the challenges encountered, lessons learned, andgive current performance results on a series of relevant benchmark problems.
1. Introduction
WarpX [20] is a fully electromagnetic Particle-in-Cell(PIC) code that is being developed as part of the US Depart-ment of Energy’s Exascale Computing Project [8]. Origi-nally designed for particle accelerator modelling, and in par-ticular for the study of laser- and beam-driven wakefield ac-celerators, it has also been used to study several other topicsin the field of laser-plasma interaction, such as probing theonset of Quantum Electrodynamics (QED) in extreme fieldsand laser-ion acceleration.WarpX implements the well-known electromagnetic PICmethod for solving the motion of relativistic, charged par-ticles in the presence electromagnetic fields. In addition,it also includes support for many advanced features, suchas: perfectly matched layers (PMLs) [16], a pseudo-spectral(PSATD) Maxwell solver [23], and multi-physics optionssuch a ionization, QED pair creation [14, 10], a QED vac-uum polarization solver [5], and binary collisions [15]. WarpXcurrently supports 2D, 3D and azimuthally decomposed ge-ometries [12] and has the ability to operate in a Lorentz-boosted reference frame [18]. It also includes support formesh refinement and dynamic load balancing through theAMReX library [26].WarpX was originally designed with many-core archi-tectures in mind. While the high-level operations such astime-stepping and MPI parallelization were implemented inC++ using AMReX data structures, the core PIC operations,such as current deposition, field gathering, and various par-
ORCID (s): (A. Myers); (A.Almgren); (L.D. Amorim); (L.Fedeli); (K. Gott); (D.P. Grote); (A. Huebl); (R. Jambunathan); (R. Lehe); (M. Rowan); (O. Shapoval); (M. Thévenet); (J.-L. Vay); (H. Vincenti); (E. Yang); (N. Zaïm); (W. Zhang); (Y. Zhao); (E. Zoni) ticle pushers and field solvers, were handled by the PIC-SAR library of Fortran kernels [22]. These routines werehighly optimized for the Intel’s Knight’s Landing architec-ture found on supercomputers such as NERSC’s Cori andALCF’s Theta platforms and featured hand-vectorized ver-sions of the core PIC operations. Related works presentingearly adoptions of accelerator hardware in PIC codes are pre-sented in Refs. [2, 3, 17, 22].As the ECP focus shifted towards GPU-based machinessuch as Summit, Frontier, and Aurora, the question natu-rally arose about what to do with the Fortran kernels in PIC-SAR. CUDA Fortran provided a way forward for platformswith NVIDIA hardware such as Summit [21], but it was notclear what support for Fortran would look like on AMD orIntel hardware. Likewise, OpenACC provided an easy-to-use model for offloading Fortran routines to NVIDIA GPUs,but again it was not clear how that would work with non-NVIDIA GPUs and compiler support was limited to a sin-gle vendor. OpenMP offered better prospects for portability;however, early implementations suffered performance prob-lems when compared to OpenACC on NVIDIA hardware.Ultimately, the choice was made to port the PIC kernelsin PICSAR from Fortran to C++, and to offload kernels us-ing either CUDA, HIP, or DPC++, depending on whetherNVIDIA, AMD, Intel hardware is targeted. This removedany need for mixed language programming, which adds sub-stantial complication to the codebase and also defeats im-portant compiler optimizations such as inlining. Addition-ally, C++ usually gets better and, importantly, earlier sup-port from vendors, owing to its prominence in industry rel-ative to Fortran. Finally, CUDA, HIP, and DPC++ offer arelatively consistent programming model across all three tar-get platforms. Any implementation differences between thethree could be entirely hidden in a performance portabilitylayer; in our case, in the
ParallelFor routines in AMReX(see Section 2.5). In this manner, NIVIDA, AMD, and IntelGPUs could all be supported, with little to no change to the
A. Myers et al.:
Preprint submitted to Elsevier
Page 1 of 10 a r X i v : . [ phy s i c s . c o m p - ph ] J a n orting WarpX to GPU platforms WarpX code required.Today, WarpX is a C++14 application with an optional,standardized Python interface (PICMI) that can be used todrive simulations interactively[4], and it runs on NVIDIA,AMD, and Intel GPUs. The core of the GPU support is the
ParallelFor kernel launching method from AMReX. In whatfollows, we give a brief overview of the features in AMReXthat WarpX uses to enable parallelization and GPU support.Then, we summarize several key lessons learned from theexperience of scaling up the code on Summit. In particu-lar, we discuss the importance of memory management, theimportance of optimizing for memory footprint as well asrun time, the importance of optimizing parallel communi-cation routines, and the importance of properly utilizing thememory hierarchy. Finally, we will present weak and strongscaling results from a uniform plasma problem setup and per-formance numbers obtained on a plasma acceleration stagebenchmark problem.
2. Parallelization
WarpX leverages the AMReX framework for paralleliza-tion and GPU offloading. In what follows we summarizethe main features of the hierarchical parallelism model thatWarpX uses to run on multi-node CPU and GPU platforms.
WarpX makes use of the AMReX-provided tools for de-scribing block-structured adaptive mesh refinement (AMR)hierarchies. From the simplest to the most complex, theseare the
IntVect , which describes a point in an integer in-dex space; the Box, which describes a region in the sameindex space and consists of low- and high- end
IntVect s plusa third
IntVect that describes the staggering (i.e. is the boxcell-, node- face, or edge-centered); the
BoxArray , which de-scribes a collection of Boxes at a given level of refinement;and the
DistributionMapping , which describes how each ofthose Boxes is mapped to MPI ranks.
Vector
Vector
Box . DistributionMapping s can be user-generated, or AMReXcan generate them for a given
BoxArray using a number of al-gorithms: round robin, knapsack, and space filling curve. Bydefault, the boxes in WarpX are assigned to MPI ranks ac-cording to the space filling curve algorithm, which attemptsto put nearby boxes on the same rank. When dynamic loadbalancing is employed, the user can select to use either spacefilling curve, which attempts to maintain spatial locality, orknapsack, which provides the most flexibility to achieve abalanced work distribution.
The basic mesh data structure in AMReX is the
FArrayBox ,which is a multi-dimensional array of floating point valuesdefined on a given Box.
FArrayBox es can be single- or multi-component and used to represent scalar or vector physical quantities. In the multi-component case, the components arestored in a Struct-of-Array style. A distributed collection of
FArrayBox es defined on a given
BoxArray and
Distribution-Mapping is called a
MultiFab . The
MultiFab data structure iswhat stores the core mesh data fields in WarpX - the electricand magnetic fields, the current density, and so forth.The core particle data structure in AMReX is the
Particle ,which consists of a collection of real and integer compo-nents. In WarpX, the majority of these components are storedin Struct-of-Array style. The exceptions are the particle po-sitions and a 64-bit integer identification number, which arestored together in a separate struct .A ParticleContainer is a distributed collection of parti-cles associated with a given AMR hierarchy. In WarpX, eachparticle species (driver beam, plasma elections, ions, etc...)is stored in a separate
ParticleContainer . Particles are as-signed to AMR levels and grids based on their physical po-sitions. The particles in the
ParticleContainer can then belooped over grid-by-grid, and PIC operations such as fieldgathering, particle pushing, and current deposition can beperformed.
The above data structures naturally lend themselves to an“MPI+X" hierarchical approach to parallelism, where "X" isone of OpenMP, CUDA, HIP, or DPC++ for on-node accel-erated compute. Boxes are assigned to MPI tasks, and wetypically use a form of over decomposition so that each MPItasks is responsible for processing multiple boxes. This al-lows for more flexible grid structures and also aids load bal-ancing via swapping boxes across ranks. AMReX provides
MultiFab and
ParticleContainer iterator objects that can in-struct each rank to loop over their local grids, processingeach one in turn. When processing individual grids, an ac-celerated compute backend such as OpenMP or CUDA canbe selected to perform the actual computations. See the sec-tion on ParallelFor for more information.
Communicating mesh and particle data between MPI ranksis handled by the AMReX framework. In particular, WarpXmakes use of the following parallel communication routines:•
FillBoundary : This method is used to fill guard cellsfor the mesh data (e.g. the electric and magnetic fieldcomponents, and current / charge densities). It fills thedata in the guard cells with the (possibly more recent)data from the corresponding valid cells. Here, "valid"refers to cells that are uniquely owned by the grid inquestion, as opposed to ghosted copies that may existon other grids / MPI processes.•
SumBoundary : This operation is analogous to
Fill-Boundary , except that instead of copying from valid toguard, it takes the values in the ghost cells and addsthem to corresponding valid cells. This is useful whendoing current and charge deposition operations on par-ticles that are near the edge of grid boundaries. These
A. Myers et al.:
Preprint submitted to Elsevier
Page 2 of 10orting WarpX to GPU platforms particles add some of their weights to guard cells, andthese contributions are summed to the proper validcell by the
SumBoundary method.•
SyncNodal : This function is used when staggered ornode-centered grids are employed to represent physi-cal quantities, for example, when using the Yee grid torepresent E, B, and J. In this case, some points in thediscretization are represented on multiple grids andpotentially multiple MPI ranks. Note that, unlike inthe cell-centered case, no one grid can be said to u-niquely “own" these points. To prevent spurious nu-merical effects, it is necessary to synchronize theseshared nodal points so that they have exactly the samevalues to machine precision. The
SyncNodal methodaccomplishes this. Several different options for de-ciding which value to use are implemented, e.g. sim-ple or weighted averaging. By default, WarpX simplychooses an arbitrary value as the ‘winner‘ and over-rides the others.•
ParallelCopy : This is the most general form of par-allel communication for mesh data in AMReX. It per-forms copy on intersection from one
MultiFab to an-other, even when those
MultiFab s have different
Box-Array s and
DistributionMapping s. This is needed when,for example, copying data between different levels ofrefinement, performing regridding or load balancingoperations, and when copying data between the PMLgrids and the rest of the domain.•
Particle Redistribution : This refers to putting parti-cles back on the proper level and grid after they havebeen pushed. AMReX includes two versions of thisoperation, one in which the particles are assumed toonly move between neighboring ranks, and another inwhich they are allowed to move between any two ranksin the MPI communicator. The former, local versionis the one used most often during normal time step-ping, while the latter version is used when performingload balancing.These parallel communication routines have been opti-mized for hybrid CPU/GPU platforms, in particular, Sum-mit. All run fully on the GPUs, meaning that they do nottrigger any unnecessary host/device copies of mesh or par-ticle data. Communication has been refactored to reducethe number of GPU kernels launched, in particular whenperforming packing and unpacking operations on MPI sendand receive buffers (see Section 3.3). Finally, WarpX cantake advantage of gpu-aware MPI implementations that canoperate directly on device data pointers, if one is available.This operation can be enabled using a runtime option. Whenturned off, right before the MPI sends are performed, thedata to be sent is copied into pinned memory buffers on thehost and the MPI exchanges are made between host memorybuffers instead.
The core of the GPU support in WarpX consists of a se-ries of
ParallelFor functions provided by the AMReX frame-work. These are similar to those provided by the perfor-mance portability layers Kokkos [6] and RAJA [1], but havebeen tailored towards the needs of structured grid applica-tions. The idea of these functions is that they separate thedetails of how the loop is performed from the loop body,which describes the operation done on each elements. Thebody is supplied using C++ lambda functions, which cap-ture (device) variables from the surrounding scope. When amrex::ParallelFor is compiled with CUDA as the backend,they translate into a GPU kernel launch. When they are com-piled without GPU support, they translate into a normal for loop on the host. Using this approach, a single codebasecan be maintained that can run on CPUs and on multipleGPU platforms. CPU loops can be further tiled for individ-ual threads to aid vectorization on CPUs, which is outside ofthe scope for this GPU-focused paper and thus omitted forbrevity.Listings 1 and 2 show two examples of
ParallelFor . Thefirst specifies the loop bounds using an AMReX
Box object,which results in a 3-dimensional loop over the cells in thebox. The second shows a one-dimensional
ParallelFor , whichloops over the particles in a grid. amrex : : P a r a l l e l F o r ( bx , [=] AMREX_GPU_DEVICE ( i n t i , i n t j , i n t k )n o e x c e p t { d s t a r r ( i , j , k , 0 ) = s r c a r r ( i , j , k , 0 ) ; } ) ; Listing 1: A
ParallelFor routine operating on a single boxof mesh data. In this example case, the threading will beperformed over the cells of a 3-dimensional box. AMReXarrays use Fortran index order. amrex : : P a r a l l e l F o r ( np , [=] AMREX_GPU_DEVICE ( i n t i ) { a m p l i t u d e [ i ] = 0 . 0 _ r t ; } ) ; Listing 2: A one-dimensional
ParallelFor used to threadover all the particles in a grid.
Parallel reductions are useful in many places in WarpX,including both “on-node" reductions over OpenMP threads(or other, GPU-based compute backends) and “off-node" re-ductions over MPI ranks. For example, these type of re-ductions are found in diagnostic functions for particles andbeams that act as an in situ data reduction technique, whichcan be performed in high frequency compared to costly fulldata output.Performing these operations efficiently in parallel in away that is portable between CPU and various GPU plat-forms is non-trivial. To aid in this task, AMReX providesgeneric functions for performing reduction operations in a
A. Myers et al.:
Preprint submitted to Elsevier
Page 3 of 10orting WarpX to GPU platforms performance-portable way. These functions can be used atthe level of contiguous arrays by passing in data pointers, orthey can work on higher-level AMReX data containers to,e.g., perform reductions over all the cells in a
MultiFab , orall the particles in a
ParticleContainer .A feature of the AMReX implementation of parallel re-ductions is that it provides an API for performing multiplereductions in one pass on any combination of data types andreduction operators. When running on GPUs, all these op-erations would be done in a single kernel launch. These re-ductions operations have been tested and implemented forCUDA, HIP, and DPC++, as well as on CPU platforms us-ing OpenMP.
3. Lessons from Summit
In the following section, we summarize some key lessonslearned from our experience of porting WarpX to Summit.These fall under three main areas: issues relating to memorymanagement and overall footprint, issues relating to parallelcommunication, and finally, the importance of cache utiliza-tion on GPU platforms.
With the trend towards GPU computing, the importanceof optimizing codes for memory consumption has increased.Consider the example of Summit. Summit has 4608 nodes,each of which has 608 GB of host memory (512 DDR4 +96 HBM2), for a system total of 2.8 Petabytes. This is con-siderably more than Cori’s KNL nodes, which have a totalaggregate memory of 1.1 Petabytes. However, if we con-sider only device memory, each Summit node has 6 NVIDIAV100 GPUs with 16 GB of memory each, for a total of only440 TB, substantially less than Cori phase II. This meansthat, provided that one wants to run in a mode in which yourproblem entirely fits on the GPUs (which is desirable con-sidering the performance penalties associated with frequenthost / device data transfers), one actually cannot run as bigof a problem on Summit than one could fit on Cori. Thismakes reducing the memory footprint of a simulation codequite impactful in terms of enabling production-level sciencecalculations.Reducing the memory footprint can have performanceimplications as well. Originally in WarpX, every particlestored persistent values for the electric and magnetic fieldsinterpolated to the particle’s position. In addition to the stor-age overhead, these values need to be communicated everytime particles change MPI domains, and shuffled around inmemory every time particles are sorted (see Section 3.4.2).Additionally, if the performance of a GPU kernel is memory-bound, meaning that its performance is limited by the rateat which data can be transferred from main memory to thestreaming multiprocessors on the GPUs, then increasing thearithmetic intensity of those kernels by streaming less dataand recomputing values on-the-fly can improve their overallperformance.Recently, WarpX removed the persistent electric and mag-netic fields at the particle positions in favor of re-gathering these values inside GPU kernels as they are needed. For this,the field gathering and particle pushing kernels were fusedtogether in the PIC loop, resulting in less data that needed tobe streamed to the processors in a given timestep. In addi-tion to reducing the memory footprint by a factor of ≈ 1 . ,this also led to a ≈ 25 % percent speedup in the overall run-time on several key benchmarks. When the field values atthe particle positions are needed more than once in a step,as, for example, when modelling additional effects such asionization or using certain diagnostics, the gather operationis simply performed multiple times.Finally, we are currently exploring other means of re-ducing the overall memory footprint of WarpX, includingexploiting single / mixed precision and employing compres-sion. Dynamic memory allocation is many times more expen-sive on GPU than CPU architectures. This fact, combinedwith common programming patterns involving temporaryvariables, can lead to drastic performance penalties on GPUsystems. For example, consider the code in Listing 3. Thissnippet demonstrates how to loop over mesh data using theAMReX data structures. The
MFIter object instructs eachMPI rank to loop over the grids it owns. For each grid, weresize a temporary scratch space called tmp , then launch a
ParallelFor kernel to do some calculations using it. The
Elixir is not essential to the point, but it keeps the scratchspace alive in memory until the kernel is finished workingwith it - this is needed due to the asynchronous nature ofGPU kernel launches. If every call to resize the buffer endedup triggering cudaMalloc and cudaFree calls, this could easilyend up becoming the dominant cost of this routine. Anotherplace this comes up is in out-of-place sorting and partition-ing operations, which require a temporary buffer in which tostore the result.One way to mitigate this is to refactor application codesto keep temporary buffers alive in memory instead of lettingthem go out of scope. However, this is error-prone and labor-intensive. Instead, AMReX provides a number of memoryarena classes, which allocate memory in large chunks anddole out pieces of it as the application runs. Thus, eventhough WarpX makes frequent use of temporary variables,during most time steps that are no calls to cudaMalloc or cudaFree .These Arenas have a number of different options for man-aging memory fragmentation; currently, the default in AM-ReX is to use a "first fit" strategy. AMReX provides memoryarenas that use host, device, pinned, and managed memory.WarpX uses these Arenas for all of its mesh and particle datastructures. By default, when running on NVIDIA GPUs,WarpX places all of its core data in managed memory. FArrayBox tmp ; f o r ( MFIter mfi ( mf ) ; mfi . i s V a l i d ( ) ; ++mfi ) { c o n s t Box& bx = mf . t i l e b o x ( ) ; tmp . r e s i z e ( bx ) ; E l i x i r e l i = tmp . e l i x i r ( ) ;
A. Myers et al.:
Preprint submitted to Elsevier
Page 4 of 10orting WarpX to GPU platforms a u t o c o n s t& t m p _ a r r = tmp . a r r a y ( ) ; amrex : : P a r a l l e l F o r ( bx , [=] AMREX_GPU_DEVICE ( i n t i , i n t j , i n t k )n o e x c e p t { compute_tmp ( i , j , k , t m p _ a r r ) ; } } Listing 3: Example of ParallelFor. This code can becompiled to run on CPU with OpenMP or GPU with CUDA,HIP, or DPC++.
Once the initial port of WarpX to NVIDIA GPUs wascomplete, the initial experience was that compute kernelssuch as current deposition and field gathering were muchfaster on V100 hardware than on KNL. However, the samewas not true for the AMReX parallel communication rou-tines. The primary reason for this was that the parallel com-munication routines involved many small, "copy on intersec-tion" routines between neighboring boxes, especially whenpacking and unpacking MPI send and receive buffers. Theseoperations involved little to no computation but launchedmany small kernels that packed and unpacked data buffers.Thus, the dominant cost in these routines was the latency as-sociated with the kernel launches, which could be fused intoa fewer number. After optimization, each MPI rank makesonly 1 kernel launch to pack and unpack its MPI buffers,which led to greatly improved performance on Summit.
As with CPU-based many-core architectures, rearrang-ing computations so that they properly exploit the memoryhierarchy can lead to significant performance increases onV100 GPUs. In this section, we discuss a case-study in thiseffect - specifically, on how periodic sorting of particle data,so that it is processed in a cache-friendly way, can greatlyimprove the performance of PIC operations like field gath-ering and current deposition on V100. First, however, wewill describe the current deposition algorithm we use, andhow it differs between CPU and GPU runs, in more detail.
In PIC codes, most operations are straightforward to par-allelize, since particles can be threaded over and processedindependently without needing to worry about potential raceconditions. Charge and current deposition operations, how-ever, require special consideration, since when threading overparticles there is the potential for collisions as multiple threadsmay attempt to update the same cell simultaneously.In WarpX, our approach to concurrent scatter operationsin particle deposition kernels varies depending on whetherwe are running with OpenMP or CUDA/HIP/DPC++ as theparallel backend. With OpenMP, the particles on a grid aresorted onto smaller sub-regions called tiles. OpenMP threadsare mapping to tiles, which begin processing them simul-taneously. Each OpenMP thread deposits particles onto its own, private deposition buffer with enough cells to capturethe support of all the particles on the tile. There is no needfor atomics at this stage, since each thread has its own buffer.After deposition onto the buffer is complete, the buffer val-ues are atomically added to the values for the full grid usingatomic writes. Thus atomics are only needed on a per-cellbasis, not a per-particle basis. When running on GPUs, how-ever, we perform atomics write directly to global memory foreach particle. This, along with periodic sorting, is sufficientto get good performance on NVIDIA V100 GPUs.
Figure 1:
The effect of sorting interval (i.e., sorting every 𝑁 time steps) and sort bin size on the overall performanceon a uniform plasma benchmark. The 𝑥 -axis shows the sortinterval, while the 𝑦 -axis shows the overall time to take 100steps, including the cost of the sorting. A sort interval > means that the particles are never re-sorted during the run. Periodic sorting of the particles on each grid by their spa-tial locations so that particles that are close to each in mem-ory also interact with cells that are close to each in memoryexploits the memory hierarchy on the GPUs more effectivelythan processing them in an unordered fashion. This is par-ticularly true in the case that particles are moving with highvelocities, such that they frequently change cells. In thatcase, even if particles are sorted at a particular time, theywill rapidly become disordered, leading to significant per-formance degradation in the particle-mesh operations.Note that we differentiate between binning, which com-putes a permutation array that assigns particle indices to cellswith user-defined bin size, and sorting, which uses this per-mutation array to actually reorder the particle data in mem-ory. Cache utilization requires full sorting, but for manyoperations simply knowing the cell-sorted indices is suffi-cient. AMReX provides a GPU-capable implementation ofthe counting sort operation that can be used to perform bothof these operations. Internally, it is built using a GPU imple-mentation of parallel prefix sum, which is based on Ref. [13]and works on NVIDIA, AMD, and Intel GPUs.In addition to the presented cache-utilization optimiza-
A. Myers et al.:
Preprint submitted to Elsevier
Page 5 of 10orting WarpX to GPU platforms
Figure 2:
Roofline analysis of the 3rd-order Esirkepov currentdeposition [7] kernel in WarpX on a single V100 GPU, withand without particle sorting. In the memory streaming limit,three different lines are shown, corresponding to the band-widths of the L1 and L2 caches as well as that for the mainhigh-bandwidth memory (HBM) on the GPU. Likewise, in thecompute-bound regime, two different values are used for thepeak floating point performance: both with and without takingadvantage of fused multiply-add instructions. The arithmeticintensity (A.I.) is measured three times for each kernel, usingthe memory traffic for each level of the memory hierarchy. Forthe sorted version, the fact the A.I. is significantly lower forthe L1 and L2 data points shows that we are getting substan-tial reuse in both levels of cache. Conversely, the fact that thedata points are all on top of each for the unsorted run indicatesthat without sorting, the degree of reuse is poor.
Figure 3:
Same as Figure 2, but for the fused gather and pushkernel in WarpX. Again, there is substantial cache reuse whensorting is employed, although for this kernel performance stillappears to be limited by HBM bandwidth, even with sorting. tion, sorting and/or binning particles is needed for the mod-eling of particle-particle interactions. The PIC method bydefault only models particle-mesh interaction and mesh up-dates. WarpX implements binary collisions, which dependon a prior binning of neighboring particles, to address vari-ous applications in accelerator and beam physics.Figure 1 shows the results of a parameter study in whichthe bin size and sorting interval were varied. For example,a bin size of 𝑥 𝑥 and sorting interval of means that par- ticles were sorted into 𝑥 𝑥 supercells every 4 timesteps.On this problem, the optimal sorting is to sort by cell (i.e.a bin size of 𝑥 𝑥 every time step, and the difference be-tween sorting optimally and not sorting at all is a factor of ≈ 7 . , with most of the improvement comings from the cur-rent deposition and fused gather and push kernels. However,this very frequent sorting interval for this problem is a spe-cial because, because the particles in this problem changecell more often than in most WarpX applications. Currently,the default in WarpX, used throughout Section 4, is to sortthe particles by their PIC cell every 4 time steps.Note that, although the Redistribute() function in AM-ReX does not maintain this cell-sorted order for particles thatleft one grid and been migrated to another, this only appliesto particles that have changed grids - typically only a smallsubset of the total that are near the “surface". The bulk of theparticles on a grid will maintain their sorted order in between
Redistribute() calls.Figures 2 and 3 show the results of a roofline analysis[24] on the current deposition and fused gather and push ker-nels in WarpX, which are the two most computationally ex-pensive operations. Our analysis followed the methodologyof [25]. For this test, we used a uniform plasma setup with 8particles per cell and gave the particles a large thermal veloc-ity, so that they frequently change cells. To eliminate effectsassociated with unified memory paging, we ran the problemfor a total of 100 steps and only profiled the last one.The roofline analysis reveals three things. First, as al-ready demonstrated, sorting the particles gives significantlybetter performance on V100 GPUs than not sorting them.Second, the fact that the arithmetic intensity measured us-ing the memory bandwidth for the L1 and L2 caches is sig-nificantly lower than for HBM indicates that, in the sortedrun, we are getting significant reuse in both of these levelsof cache. Third, the arithmetic intensity for the current de-position for the sorted run is right up against the streaminglimit for the L2 cache. This indicates that the performance ofthis kernel is now limited by the L2 cache bandwidth. Gatherand push, on the other hand, is likely still limited by HBMbandwidth. Taken together, these results suggest that thesekernels should get significantly better performance on theA100, which has a larger L2 cache and higher HBM band-width than the V100.
4. Performance Results
In this section, we give current performance results onSummit for two key benchmark problems. We concentrateon two areas - the scaling of the code on a uniform plasmatest case and the performance on a plasma accelerator bench-mark problem.
In order to test the scaling of WarpX in an idealized set-ting, as well as to gauge the speedup associated with us-ing accelerated nodes, we have performed a weak scaling
A. Myers et al.:
Preprint submitted to Elsevier
Page 6 of 10orting WarpX to GPU platforms study using a uniform plasma setup on OLCF’s Summit su-percomputer. The base case for this scaling study used a256 x 256 x 384 domain with a box size of 128 and ranon 1 Summit node; thus, on the GPU-accelerated runs, eachGPU was responsible for processing two 128 -sized boxes.Particles were initially distributed uniformly with 8 parti-cles per cell. We used the standard Yee FDTD solver forthese runs, with Esirkepov current deposition and third or-der shape functions. For the weak-scaling study, the numberof Summit nodes were doubled with the number of cells (andparticles therein) in the x-, y-, or z- directions, while hold-ing everything else constant, maintaining a constant work-load per node. We continued this process up to 2048 nodes- about half of the Summit machine. Overhead associatedwith time spent in problem initialization, memory allocation,etc., was minimized by running for a total of 100 steps.The results are shown in Figure 4. We performed theabove scaling study twice, once using all six GPUs per Sum-mit node, and again using only the POWER9 CPUs. Forboth runs, we used 6 MPI tasks per node. For the GPU-accelerated runs, we used one GPU per MPI task, and forthe CPU-only case, we used 7 OpenMP threads per task, sothat all 42 cores on the node were active. Using these re-sults, we can characterize both the weak scaling behavior ofthe CPU and GPU versions of the WarpX, as well as see theoverall speedup obtained on Summit from using the acceler-ators. In both cases, the code scales well up to 2048 nodes.The weak scaling efficiency, defined as the total time takenfor 100 time steps on 1 node divided by the total taken on2048 nodes, is 81% for the GPU case and 90% for the CPUcase. The difference in scaling efficiency between the CPUand GPU can be attributed to the fact that, because the localwork is significantly faster when using the V100s, commu-nication operations like FillBoundary , which are inherentlyharder to scale, become relatively more expensive. Addi-tionally, the speedup from the accelerators at all scales testedwas a factor of 30. This speedup refers to the total run time,including time associated with host / device memory trafficand communication, not to isolated compute kernels.
We have also conducted a series of strong scaling tests,using a very similar uniform plasma problem setup as before.The only difference is that the box size has been set to , toallow for more GPUs / MPI tasks to be used as the problem isstrong scaled. There is some overhead associated with doingthis, since with smaller boxes, the surface to volume ratio ofghost cells is higher. Other than the box size, the parametersare all the same as before.We use a series of problem sizes, each scaled up a fac-tor of 2 in terms of the number of cells and particles in thedomain. For each one, we conduct a series of five runs, in-creasing the number of MPI tasks by a factor of 2 each time.Thus, in the fifth run, the run time should have decreased by afactor of 16, assuming perfect strong scaling. By the time wehave multiplied the number of MPI ranks by 16, this prob-lem has reached the point where the compute work and the Figure 4:
Results of a weak scaling study on a uniform plasmasetup on Summit. The x-axis shows the number of Summitnodes, while the y-axis is the number of particles advances pernanosecond. Both the CPU and GPU versions of the codescale well, and the overall speedup associated with using theaccelerators is ∼ 30 . communication work take approximately the same amountof time, so we would not expect the problem to scale furtherthan that.The smallest scaling study in this series goes from 1 to16 nodes, while the largest goes from 256 to 4096, nearly theentire machine. The scaling efficiency, defined as the timea run should take assuming perfect strong scaling within aproblem size and perfect weak scaling from the base prob-lem size divided by the actual run time, is plotted in Figure 5.The efficiencies after strong scaling by a factor of 16 for eachproblem size vary from approximately 70% for the smallestcase to approximately 50% for the largest. The above tests were highly idealized in several ways.First, the workload was perfectly uniform at initial time, andapproximately uniform at later times, subject only to randomfluctuations in the particle density from cell to cell. Second,the number of particles per cell, 8, is significantly higherthan used in some WarpX physics applications. Laser-wake-field acceleration runs, for example, tend to use about 2 par-ticles per cell on average, which can change the performanceprofile of the code. Evaluating WarpX on this importantscience scenario, the following setup was used, designed tomimic the essential features of modelling a single plasma-accelerator stage from WarpX’s challenge problem. Thisis also the benchmark problem used to determine a Figure-of-Merit (FOM) for the ECP Key-Performance Parameters(KPP) assessment. As a KPP-1 project, WarpX needs toshow at least a factor of 50 increase in its FOM over thebaseline on the eventual Exascale hardware. In this setup,
A. Myers et al.:
Preprint submitted to Elsevier
Page 7 of 10orting WarpX to GPU platforms
Figure 5:
Strong scaling studies for a variety of problem sizes.Each tick type refers to a different problem size. The 𝑥 -axisshows the number of Summit nodes, and the 𝑦 -axis shows scal-ing efficiency, defined as the time a run should take assumingperfect strong scaling within a problem size and perfect weakscaling from the base problem size, divided by the actual runtime. an accelerated particle beam is tracked using the movingwindow feature in WarpX, in which the simulation domainitself shifts along with the beam at speed c. Additionally,the entire simulation is modeled in a Lorentz-boosted refer-ence frame [18], using a gamma boost of 30. New plasma iscontinuously injected at the right-hand side of the domain,while particles that leave the domain at the left-hand sideare removed from the simulation. The plasma consists oftwo particles per cell (one electron and one proton), whilethe accelerated beam is comprised of electrons. Mitigatingthe numerical Cherenkov instability in the modeling of a rel-ativistically flowing plasma, the Godrey filter [9] is appliedto the electromagnetic fields prior to gathering them to par-ticle positions. For the algorithmic options, we have usedthe Vay particle pusher [19], the Cole-Karkkainen-CowanFDTD solver [11], and energy-conserving field gathering.We have again used Esirkepov current deposition with 3rd-order interpolation. To minimize the computer time neededto conduct these simulations, we initialize the problem tohave the simulation domain entirely filled with plasma, whichwould normally not be the case when modelling an acceler-ator stage.To gauge the impact of using accelerated nodes on thismore realistic problem setup, we have measured the FOM onSummit, defined asFOM = num_cells ∗ ( 𝛼 + 𝛽 ∗ ppc )∕ avg_time_per_it (1)where num_cells is the total number of grid points in thesimulation, 𝛼 is 0.1 as heuristic grid update cost, 𝛽 is 0.9for particle update costs, ppc is the average number of par-ticles per cell, and avg_time_per_it is the average time periteration after 1000 steps. We performed this measurementon 4263 Summit nodes, and extrapolated this number to thefull machine assuming perfect weak scaling. Our baseline FOM was measured on NERSC’s Cori using the originalWarp code. The baseline FOM value, measured in March2019 on 6625 Cori nodes and extrapolated to the 9668 onthe full machine, was 2.2e10. The corresponding value onSummit, measured in July 2020, was 2.5e12, over a factor of100 improvement from the baseline. Additionally, the bestCPU-only FOM obtained using the WarpX code was 1.0e11,also measured in March 2019. So there is a substantial (25x)improvement in our FOM measured with WarpX from usingthe GPUs on Summit, as compared to Cori. These valuesare all summarized in Table 1, along with several other datapoints showing the evolution of WarpX’s FOM over time. Table 1
Progress in the FOM measurement over time. Code: ei-ther the original Warp code (baseline) or WarpX. Date:the date when the measurement was taken. Machine:which computer was used to make the measurement.Nodes: how many nodes the measurement was per-formed on; there are 9668 KNL nodes on Cori and 4608nodes on Summit. FOM: the figure of merit, extrapo-lated from the number of nodes the measurement wastaken on to the full machine.Code Date Machine Nodes FOMWarp 3/2019 Cori (KNL) 6625 2.2e10WarpX 3/2019 Cori (KNL) 6625 1.0e11WarpX 6/2019 Summit 32 8.6e11WarpX 6/2019 Summit 1000 7.8e11WarpX 9/2019 Summit 2560 6.8e11WarpX 1/2020 Summit 2560 1.0e12WarpX 2/2020 Summit 4263 1.2e12WarpX 6/2020 Summit 4263 1.4e12WarpX 7/2020 Summit 4263 2.5e12
5. Conclusion
We have summarized the approach taken to portingWarpX, which was originally designed for many-core CPUarchitectures, to take advantage of GPU-accelerated nodes.This approach is largely based on the amrex::ParallelFor setof performance portability functions. We have summarizedseveral key lessons learned from the port, including the im-portance of managing memory allocation and the code’s over-all memory footprint, the importance of minimizing the ef-fect of kernel launch latency in MPI communication rou-tines, and the importance of utilizing the cache hierarchy onV100 GPUs. The GPU port of WarpX scales up to nearlyall of Summit and currently sees good improvements in itsKPP-1 figure of merit on Summit relative to its baseline.
6. Acknowledgements
This research was supported by the Exascale ComputingProject (17-SC-20-SC), a joint project of the U.S. Depart-ment of Energy’s Office of Science and National Nuclear Se-curity Administration, responsible for delivering a capableexascale ecosystem, including software, applications, and
A. Myers et al.:
Preprint submitted to Elsevier
Page 8 of 10orting WarpX to GPU platforms hardware technology, to support the nation’s exascale com-puting imperative. This work was performed in part underthe auspices of the U.S. Department of Energy by LawrenceBerkeley National Laboratory under ContractDE-AC02-05CH11231, SLAC National Accelerator Labo-ratory under contract AC02-76SF00515, and Lawrence Liv-ermore National Laboratory under ContractDE-AC52-07NA27344.This research used resources of the Oak Ridge Leader-ship Computing Facility, which is a DOE Office of ScienceUser Facility supported under ContractDE-AC05-00OR22725.WarpX is developed as open source project and avail-able under https://github.com/ECP-WarpX/WarpX . Presentedcode versions correspond to the monthly releases of the codebetween 3/2019 and 10/2020. The data that support the find-ings of this study are available under DOI:10.5281/zenodo.4277941.
References [1] Beckingsale, D.A., Burmark, J., Hornung, R., Jones, H., Killian,W., Kunen, A.J., Pearce, O., Robinson, P., Ryujin, B.S., Scogland,T.R., 2019. Raja: Portable performance for large-scale scientificapplications, in: 2019 IEEE/ACM International Workshop on Per-formance, Portability and Productivity in HPC (P3HPC), pp. 71–81.doi: .[2] Burau, H., Widera, R., Hönig, W., Juckeland, G., Debus, A., Kluge,T., Schramm, U., Cowan, T.E., Sauerbrey, R., Bussmann, M., 2010.Picongpu: A fully relativistic particle-in-cell code for a gpu cluster.IEEE Transactions on Plasma Science 38, 2831–2839. doi: .[3] Bussmann, M., Burau, H., Cowan, T.E., Debus, A., Huebl, A.,Juckeland, G., Kluge, T., Nagel, W.E., Pausch, R., Schmitt, F.,Schramm, U., Schuchart, J., Widera, R., 2013. Radiative signa-tures of the relativistic kelvin-helmholtz instability, in: Proceedingsof the International Conference on High Performance Computing,Networking, Storage and Analysis, ACM, New York, NY, USA. pp.5:1–5:12. URL: https://github.com/ComputationalRadiationPhysics/picongpu , doi: .[4] CAMPA Collaboration, . Particle-in-cell modeling interface (PICMI).URL: https://github.com/picmi-standard/picmi .[5] Carneiro, P., Grismayer, T., Fonseca, R., Silva, L., 2017. Quantumelectrodynamics vacuum polarization solver. arXiv:1607.04224 .[6] Carter Edwards, H., Trott, C.R., Sunderland, D., 2014. Kokkos: En-abling manycore performance portability through polymorphic mem-ory access patterns. Journal of Parallel and Distributed Computing74, 3202 – 3216. doi: . domain-SpecificLanguages and High-Level Frameworks for High-Performance Com-puting.[7] Esirkepov, T., 2001. Exact charge conservation scheme for particle-in-cell simulation with an arbitrary form-factor. Computer PhysicsCommunications 135, 144 – 153. doi: .[8] Exascale Computing Project, . Homepage. URL: .[9] Godfrey, B., Vay, J.L., 2014. Suppressing the numerical cherenkovinstability in fdtd pic codes. Journal of Computational Physics 267.doi: .[10] Gonoskov, A., Bastrakov, S., Efimenko, E., Ilderton, A., Marklund,M., Meyerov, I., Muraviev, A., Sergeev, A., Surmin, I., Wallin, E.,2015. Extended particle-in-cell schemes for physics in ultrastronglaser fields: Review and developments. Phys. Rev. E 92, 023305.doi: .[11] Karkkainen, M., Gjonaj, E., Lau, T., Weiland, T., 2006. Low-Dispersionwake Field Calculation Tools, in: Proc. Of International Computational Accelerator Physics Conference, Chamonix, France.pp. 35–40.[12] Lifschitz, A., Davoine, X., Lefebvre, E., Faure, J., Rechatin, C.,Malka, V., 2009. Particle-in-cell modelling of laser–plasma interac-tion using fourier decomposition. Journal of Computational Physics228, 1803 – 1814. doi: .[13] Merrill, D., Garland, M., 2016. Single-pass Parallel Prefix Scan withDecoupled Lookback. Technical Report NVR2016-001. NVIDIAResearch. URL: https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf .[14] Nikishov, A., 1970. Pair production by a constant external field. Sov.Phys. JETP 30, 660.[15] Pérez, F., Gremillet, L., Decoster, A., Drouin, M., Lefebvre, E., 2012.Improved modeling of relativistic collisions and collisional ionizationin particle-in-cell codes. Physics of Plasmas 19, 083104. doi: .[16] Shapoval, O., Vay, J.L., Vincenti, H., 2019. Two-step perfectlymatched layer for arbitrary-order pseudo-spectral analytical time-domain methods. Computer Physics Communications 235, 102 – 110.doi: .[17] Surmin, I., Bastrakov, S., Matveev, Z., Efimenko, E., Gonoskov, A.,Meyerov, I., 2016. Co-design of a particle-in-cell plasma simulationcode for intel xeon phi: A first look at knights landing, in: Algorithmsand Architectures for Parallel Processing, Springer International Pub-lishing, Cham. pp. 319–329. doi: .[18] Vay, J.L., 2007. Noninvariance Of Space- And Time-Scale RangesUnder A Lorentz Transformation And The Implications For TheStudy Of Relativistic Interactions. Physical Review Letters 98,130405/1–4. doi: .[19] Vay, J.L., 2008. Simulation of beams or plasmas crossing at relativis-tic velocity. Physics of Plasmas 15, 056701. doi: .[20] Vay, J.L., Almgren, A., Amorim, L.D., Bell, J., Ge, L., Gott, K.,Grote, D.P., Hogan, M., Huebl, A., Jambunathan, R., Lehe, R., Myers,A., Ng, C., Park, J., Rowan, M., Shapoval, O., Thévenet, M., Zhang,W., Zhao, Y., Zoni, E., 2020. Toward the modeling of chains of plasmaaccelerator stages with WarpX. Journal of Physics: ConferenceSeries 1596, 012059. URL: https://github.com/ECP-WarpX/WarpX ,doi: .[21] Vazhkudai, S.S., de Supinski, B.R., Bland, A.S., Geist, A., Sexton,J., Kahle, J., Zimmer, C.J., Atchley, S., Oral, S., Maxwell, D.E., Lar-rea, V.G.V., Bertsch, A., Goldstone, R., Joubert, W., Chambreau, C.,Appelhans, D., Blackmore, R., Casses, B., Chochia, G., Davison, G.,Ezell, M.A., Gooding, T., Gonsiorowski, E., Grinberg, L., Hanson,B., Hartner, B., Karlin, I., Leininger, M.L., Leverman, D., Marroquin,C., Moody, A., Ohmacht, M., Pankajakshan, R., Pizzano, F., Rogers,J.H., Rosenburg, B., Schmidt, D., Shankar, M., Wang, F., Watson, P.,Walkup, B., Weems, L.D., Yin, J., 2018. The design, deployment, andevaluation of the coral pre-exascale systems, in: SC18: InternationalConference for High Performance Computing, Networking, Storageand Analysis, pp. 661–672. doi: .[22] Vincenti, H., Lobet, M., Lehe, R., Sasanka, R., Vay, J.L., 2017. Anefficient and portable simd algorithm for charge/current deposition inparticle-in-cell codes. Computer Physics Communications 210, 145– 154. doi: .[23] Vincenti, H., Vay, J.L., 2016. Detailed analysis of the effects of stencilspatial variations with arbitrary high-order finite-difference Maxwellsolver. Computer Physics Communications 200, 147–167. doi: .[24] Williams, S., Waterman, A., Patterson, D., 2009. Roofline: An in-sightful visual performance model for multicore architectures. Com-mun. ACM 52, 65–76. doi: .[25] Yang, C., Kurth, T., Williams, S., 2020. Hierarchical roofline anal-ysis for gpus: Accelerating performance optimization for the nersc-9perlmutter system. Concurrency and Computation: Practice and Ex-perience 32, e5547. doi: . e5547 cpe.5547.[26] Zhang, W., Almgren, A., Beckner, V., Bell, J., Blaschke, J., Chan,C., Day, M., Friesen, B., Gott, K., Graves, D., Katz, M.P., Myers, A.,Nguyen, T., Nonaka, A., Rosso, M., Williams, S., Zingale, M., 2019.
A. Myers et al.:
Preprint submitted to Elsevier
Page 9 of 10orting WarpX to GPU platforms
AMReX: a framework for block-structured adaptive mesh refinement.Journal of Open Source Software 4, 1370. URL: https://ccse.lbl.gov/AMReX , doi: . A. Myers et al.: