VPIC 2.0: Next Generation Particle-in-Cell Simulations
Robert Bird, Nigel Tan, Scott V. Luedtke, Stephen Lien Harrell, Michela Taufer, Brian Albright
FFEBRUARY 17, 2021 1
VPIC 2.0: Next GenerationParticle-in-Cell Simulations
Robert Bird, Nigel Tan, Scott V. Luedtke, Stephen Lien Harrell, Michela Taufer, Brian Albright (cid:70)
Abstract —VPIC is a general purpose Particle-in-Cell simulation codefor modeling plasma phenomena such as magnetic reconnection, fu-sion, solar weather, and laser-plasma interaction in three dimensionsusing large numbers of particles. VPIC’s capacity in both fidelity andscale makes it particularly well-suited for plasma research on pre-exascale and exascale platforms. In this paper we demonstrate theunique challenges involved in preparing the VPIC code for operation atexascale, outlining important optimizations to make VPIC efficient on ac-celerators. Specifically, we show the work undertaken in adapting VPICto exploit the portability-enabling framework Kokkos and highlight theenhancements to VPIC’s modeling capabilities to achieve performanceat exascale. We assess the achieved performance-portability trade-offthrough a suite of studies on nine different varieties of modern pre-exascale hardware. Our performance-portability study includes weak-scaling runs on three of the top ten TOP500 supercomputers, as wellas a comparison of low-level system performance of hardware from fourdifferent vendors.
Index Terms —Simulation, Portability, Plasma Physics, Particle-in-Cell
NTRODUCTION
Many plasma physics phenomena can only be understoodthrough the use of kinetic Particle-in-Cell (PIC) simulations.VPIC [1]–[3] has been at the forefront of this research tostudy plasma physics phenomena including magnetic re-connection, fusion, solar weather, and laser-plasma inter-action. VPIC was deployed to perform high performancemulti-trillion particle simulations on over 2 million MPI pro-cesses on the Trinity supercomputer and other computingclusters at Los Alamos National Laboratory [3], [4].The original VPIC code (referred to throughout as VPIC1.0) was built to minimize data movement and maximizeperformance; given a new HPC platform, the ad-hoc re-engineering of the code had guaranteed the continuationof performance. However, the increasing heterogeneity ofemerging exascale platforms makes it no longer practicalto optimize the VPIC code for every available computeplatform. A new multi-architectural VPIC code is needed • Nigel Tan and Michela Taufer are with the University of Tennessee atKnoxville, Knoxville, TN, USA.E-mail: [email protected], [email protected]. • Robert Bird, Scott V. Luedtke, and Brian Albright are with Los AlamosNational Laboratory,Los Alamos, NM, USA.E-mail: [email protected], [email protected], [email protected]. • Stephen Lien Harrell is with the Texas Advanced Computing Center atthe University of Texas at Austin, Austin, TX, USA.E-mail: [email protected]. to ensure portability and performance across exascale plat-forms, but there are three challenges to building this code.First, the build system complexity grows significantly withthe number of architectures supported. Second, the codemust allow usage of native hardware focused APIs andlibraries. Hardware vendors and library developers putsignificant effort into optimizing common operations andfunction calls. Not allowing access to existing optimizedlibraries creates more work for the developers and reducesperformance. The third, and probably the most tangible anddifficult, challenge is loss in performance. Portable codecomes at the cost of performance due to differences inparallelism approaches among platforms. Writing portablecode requires forcing a single parallelism model onto manyplatforms. The mismatch between model and platform leadsto a loss in performance.There are multiple methods to overcome these chal-lenges and to create a multi-architectural VPIC code whilereducing long-term maintenance. These include pragma-based directives such as OpenMP and template-based li-brary approaches such as Kokkos [5] and Raja [6]. Template-based library methods are particularly powerful because thecan reduce the amount of code needed to support multi-architectural codes. Furthermore, frameworks like Kokkosand Raja can make architectural code-branching decisions atcompile time while minimizing the architecture-dependentcode required by the application developers. In this paper,we address the three challenges listed above to transformedthe original VPIC code (VPIC 1.0), a CPU-only, platform-specific optimized code, into VPIC 2.0; a multi-architectural,portable code, with the support of the Kokkos abstractionlayer. The single VPIC source code runs across a diverserange of modern hardware.Transforming VPIC into a high performance portablecode capable of running on emerging heterogeneous plat-forms requires several changes. Using Kokkos effectivelyinvolves changes to both VPIC’s data structures and userinterfaces. VPIC 2.0 alters data storage to leverage the ben-efits of Kokkos Views (advanced multidimensional arrays).Using Views for storage allows us to automatically adjustthe memory layout to best match the target hardware. Ex-tending the user interface is required to maintain backwardscompatibility and improve the performance of simulationdiagnostics. Other transformations that we introduce focuson further improving the portability and performance ofVPIC 2.0 through code restructuring and optimizations. a r X i v : . [ c s . D C ] F e b EBRUARY 17, 2021 2
VPIC 2.0 scalability and portability is tested across nineplatforms including the Sierra, Frontera, and Selene super-computers. We demonstrate excellent weak scaling, espe-cially the GPU-based platforms where we see less than 10%slowdown between 64 and up to 7,200 GPUs. The rise inGPU cache size opens up intriguing possibilities for super-linear speedup at tangible scientific scales, as shown in ourstrong scaling results. Our portability study demonstratesthe effectiveness of VPIC 2.0 across all nine platforms andhighlights both the challenges and strengths of portablecode. VPIC 2.0 is easy to adapt and run on emerging hard-ware platforms with limited manual work. The portabilityof VPIC 2.0 enables the use of cutting edge acceleratorsfor greater performance and facilitates longer simulations,more advance diagnostics, and faster turn around time. Thecontributions of this paper are as follows: • An effective approach for performance portability • The performance and portability study on nine plat-forms • The study of weak and strong scaling across CPUand GPU platforms (including 3 top 10 machines) • The analysis of the costs and benefits to portable code • Insights on the impacts of VPIC 2.0 on future scien-tific discoveryThe rest of this paper is structure as follows: Sec 2presents key changes in modern computing and providesa overview of Kokkos; Sec 3 introduces the physics behindVPIC; Sec. 4 presents our transformation to the VPIC codeneeded to achieve portability; Sec 5 shows performance andscalability results; Sec 6 describes the impact of the newlyrelease VPIC on scientific discovery at exascale; Sec 7 citesrelated work; and Sec 8 summarizes this work.
ACKGROUND
Supercomputing has a rich history of exploring and exploit-ing different aspects of parallelism. From the first vectorsupercomputers in the 1970s, to the massive clusters ofcomputers of today, accelerators have been used to offloadtime-consuming tasks from the primary compute unit(s).This was first seen with specific-purpose co-processors [7],[8], and later evolved to include generic purpose acceleratorhardware [9] and GPUs [10] as the cornerstone of acceleratorbased computing.Accelerators today provide a highly parallelized com-pute capacity. We have seen a few major approaches inaccelerators in the last 10 years. The first, the GPGPU whichuses hundreds of graphics cores that often only compute atsingle or half precision. The second modern accelerator typecontains many general purpose cores which are generallyslower than the primary compute unit. Some examples ofthis type are the Intel Knights Corner [11] and SunwaySW26010 [12] accelerators. Finally, while a large amount ofSIMD capability now sits on the package of most modernCPUs, a third group of accelerators exists that providesadditional vector capability, such as the NEC Vector En-gine [13].These characteristics and the accompanying executionmodels of modern accelerators create many challenges for maintaining and improving performance, which is the pri-mary motivation for using accelerators in high performancecomputing. Porting an existing code base to a new acceler-ator architecture while maintaining performance can havemany facets including: novel memory models; explicit con-cerns around data locality and data movement; unfamiliarexecution models; and changes in the scale of parallelism.Any of these issues can be core to maintaining performancethrough an architecture change. Further compounding thiscomplexity is that once a code base is ported to a newplatform and is performant, there may now be a new branchor code variant to maintain.
Kokkos [5] is a suite of libraries and abstractions for devel-oping high performance C++ applications that are portableacross many architectures, including all major CPUs andaccelerators. Kokkos achieves high performance and porta-bility by mapping its own programming model to variousnative backend parallel programming languages such asCUDA or HIP.There are three major elements of the Kokkos ecosystem:the Kokkos kernels libraries, the Kokkos tools interface andthe Kokkos core programming model. The Kokkos kernelslibraries provide common linear algebra (e.g., BLAS, Sparse-BLAS) and graph algorithms. Kokkos tools are comprised ofan interface and a set of utilities that connect to the Kokkosruntime and enable lightweight debugging and profiling.The Kokkos programming model enables developers tocontrol memory layout, placement of data, parallel execu-tion, and execution devices. Kokkos maps specified datacharacteristics and execution policies to the target hardwarebackends using a set of rules for automatically determiningdifferent hardware specific parameters, removing the userfrom the mapping task.The Kokkos programming model abstractions managecommon data structures and parallel execution as seen inFig. 1. Kokkos data structures have three key extensionsdefining the memory space, layout, and traits. The memoryspace determines where the memory is allocated. This ab-straction enables the use of different memory technologies(e.g., High Bandwidth Memory or HBM, DDR SDRAM, andNon-Volatile Memory). Memory layout controls the order inwhich memory is allocated and stored, affecting the memoryaccess pattern. Kokkos includes the most common layouts(e.g., row/column major, strided, and tiled). These layoutsare important for maximizing performance and interoper-ability with various linear algebra libraries. Kokkos alsosupports various memory traits that affect how memory isaccessed (e.g., streaming, atomic, restricted) and offer theusers fine grained control to be used to expose additionalinformation or optimization opportunities. Kokkos Viewsare the most common data structure used in VPIC. Views arereference counted, multi-dimensional arrays with optionalmemory traits to control memory access. These memorytraits, combined with hardware settings determined at com-pile time, enable Kokkos applications to operate on signifi-cantly different hardware platforms with a single code pathneeded to be written by the developer. For example, GPUapplications generally require memory to be setup on the
EBRUARY 17, 2021 3 host and copied to the device. Kokkos addresses this patternby having a HostMirror View that acts as the host copy ofa device View. Developers can use a deep_copy operationto move data between Views, including between a host anda device. If the target device is the same as the host, in thecase of a CPU only build, any unnecessary deep_copy willchange into a noop .Fig. 1: Core Kokkos abstractions for ensuring performanceportability [14].Kokkos’s parallel executions are defined via executionspaces, patterns, and policies. An execution space defineswhere an operation is to be run (e.g., CPU or GPU andthe execution mechanisms). The pattern of execution (orparallel dispatch) defines the parallel operation to execute(e.g., parallel_for , reduce, or scan, and task_spawn ).Execution policies define how the operation should beexecuted (e.g., over a range of indices) with a team ofthreads, or using a task-graph. Given an application, whenit is time to define its parallel execution, Kokkos can eitherautomatically choose hardware specific settings or allowdevelopers to use their own settings for finer performancetuning.
ECTOR P ARTICLE -I N -C ELL (VPIC) O
VERVIEW
The Particle-In-Cell method is among the most widely usedcomputational tools in plasma physics. Being fully kinetic,PIC codes are used when fluid approximations to plasmaphysics fail. With its first principles approach to the physicsand fast explicit time stepping, PIC codes have been used tosimulate plasmas ranging from micron- and femtosecond-scale short-pulse laser-plasma interactions [15], [16] tolightminute- and day-scale supermassive black hole accre-tion disks [17], [18]. Other uses include: laser-plasma insta-bility [19] and cross-beam energy transfer [20] simulationsfor Inertial confinement fusion; magnetic flux ropes in theheliosphere [21]; magnetic reconnection in laboratory plas-mas [22]; and Earth’s magnetosphere [23].The PIC method solves the Maxwell-Boltzmann systemof equations ∇ · (cid:126)E = 1 (cid:15) ρ, (1a) ∇ × (cid:126)E = − ∂ (cid:126)B∂t , (1b) ∇ · (cid:126)B = 0 , (1c) ∇ × (cid:126)B = µ (cid:126)J + µ (cid:15) ∂ (cid:126)E∂t , (1d) ∂ t f s + (cid:126)v · ∇ x f s + q s (cid:104) (cid:126)E + (cid:126)v × (cid:126)B (cid:105) · ∇ p f s = C s ( f ) , (2)where ∇ is the del operator, (cid:126)E is the electric field, (cid:126)B is themagnetic or B field (in Tesla), ρ is the charge density, (cid:126)J is thecurrent density, (cid:15) and µ are the vacuum permittivity andpermeability, respectively, C s ( f ) is the collision operator,and f s = f s ( (cid:126)x, (cid:126)v, t ) = (cid:104)F s ( (cid:126)x, (cid:126)v, t ) (cid:105) is the plasma distri-bution function, where the brackets denote the ensembleaverage and F s is the microscopic phase-space distributionfunction, the subscript s denotes the species, (cid:126)x is position, (cid:126)v is velocity, and (cid:126)p is momentum.Conceptually, the PIC method solves the electromagneticfields on a grid and moves particles continuously throughspace. The particles move in response to the fields accordingto the Lorentz force law, and deposit current on the grid.They do not directly interact with each other, unless colli-sions are modeled. The fields evolve according to Maxwell’sequations, Eqs. 1, including the current from the particles.The plasma distribution function is represented byquasiparticles, also called simulation particles or macropar-ticles. Quasiparticles are single particles of a particularspecies in the simulation that represent a number of realparticles given by their weight, w . The PIC method thussamples the smooth distribution function f s , which is theensemble average of the “ground truth” microscopic F s .Typically, w (cid:29) .Furthermore, quasiparticles have a shape (i.e., a distri-bution of mass charge over some spatial extent). For a givenparticle shape, φ , the distribution function is represented inPIC as f s ( (cid:126)x, (cid:126)v, t ) = N s (cid:88) i =1 w i φ ( (cid:126)x − (cid:126)x i ( t )) δ ( (cid:126)v − (cid:126)v i ( t )) . (3)Typically, the particle shape functions are B-splines: b ( ξ ) = (cid:26) if | ξ | < / otherwise , (4) b n +1 = (cid:90) ∞−∞ b ( ξ − ξ (cid:48) ) b n ( ξ − ξ (cid:48) ) dξ (cid:48) , (5)where ξ is the distance to the quasiparticle center normal-ized to the grid spacing. Zeroth order is called top-hat shape,and first is triangle. The 3D shape function is typically thetensor product of the 1D shape functions in each dimension.The field experienced by a quasiparticle with a finite spatialextent is (cid:126)E i = (cid:90) (cid:126)E φ ( (cid:126)x − (cid:126)x i ) d(cid:126)x, (6) (cid:126)B i = (cid:90) (cid:126)B φ ( (cid:126)x − (cid:126)x i ) d(cid:126)x. (7)Since the fields are defined on the grid, evaluating theseintegrals requires an interpolation scheme [24], [25], thedetails of which can be messy.All PIC codes at their core consist of two coupled solvers:the particle pusher and the field solver. The particle pushermoves plasma particles freely through space based on sur-rounding EM fields and calculates currents arising from thismotion. The field solver solves Maxwell’s equations on afixed grid, accounting for the currents from the particles. EBRUARY 17, 2021 4
These core algorithms fully reproduce the behavior of col-lisionless, relativistic, kinetic plasmas. Additional physics,such as collisions and special boundary conditions, can beadded with additional physics modules.Maxwell’s equations are solved using the widely usedfinite-difference time-domain (FDTD) method [26] on theYee staggered grid [27]. The staggered Yee grid allows foreasily implemented centered, second order accurate deriva-tives. The particle pusher solves the relativistic equations ofmotion d(cid:126)xdt = (cid:126)v, (8)and d ( γ(cid:126)v ) dt = qm ( (cid:126)E + v × B ) . (9)VPIC uses the Boris pusher [28], a second-order accuratepusher that employs the leapfrog method, slightly modifiedto avoid cyclotron motion aliasing, similar to the methodused in [29]. The Boris pusher splits the equations of motioninto an acceleration from the (cid:126)E field and a rotation aboutthe (cid:126)B field. A particle is first pushed a half-step, based on itsprevious momentum. The fields are interpolated to this half-step position, the momentum is updated with these fields,and the particle is moved the full step. Finally, the particle ismoved to the three-halves position to calculate the currentused by the field solver, and the three-halves position isdiscarded.More formally, the difference equations are (cid:126)x n + − (cid:126)x n − ∆ t = (cid:126)v n , (10)and (cid:126)p n +1 − (cid:126)p n ∆ t = q ( (cid:126)E n + + (cid:126)v n + × (cid:126)B n + ) , (11)where (cid:126)v = (cid:126)p/ ( mγ ) , and (cid:126)v n + = (cid:126)p n + (cid:126)p n +1 mγ n + . (12) FOR H ETEROGENEOUS P LATFORMS
Two components of the VPIC code need to be redesigned inorder to deploy the Kokkos abstraction layer effectively: (1)the VPIC data structure elements and (2) the user interface.
VPIC operates by defining a simulation space divided into agrid of cells, and modeling particle movement across thesecells. In other words, particles are distributed across an n-dimensional (n-D) space that is decomposed into a n-D grid.Each simulated particle is a macroparticle with a definedweight (i.e., the number of real particles modeled by eachmacroparticle). There are four key data elements in VPICthat must be moved to Kokkos Views to ensure portability.They are the macroparticles, electromagnetic (EM) fields,interpolators, and current accumulators. Fig. 2a describesthe original C++ definition of macroparticle storage in VPIC1.0 before the Kokkos port. Each macroparticle is a 32-bytestructure (i.e., 3 floats for voxel position dx , dy , and dz , typedef s t r u c t P a r t i c l e { f l o a t dx ; / / P o s i t i o n in ( x , y , z ) dimensions f l o a t dy ; f l o a t dz ; f l o a t ux ; / / Momentum in ( x , y , z ) dimensions f l o a t uy ; f l o a t uz ; f l o a t w; / / Weight o f s i m u l a t e d p a r t i c l e int i ; / / Index o f c e l l where p a r t i c l e r e s i d e s } ;P a r t i c l e p a r t i c l e s [N] ; / / P a r t i c l e s t o r a g e p a r t i c l e s [ 0 ] . dx ; / / Access dx from p a r t i c l e 0 (a) Original VPIC macroparticle. View < f l o a t *[7] > p a r t i c l e s (N) ; / / P a r t i c l e data namespace p a r t i c l e v a r { enum p v { / / P a r t i c l e member enum f o r c l e a n a c c e s s dx=0 , / / P o s i t i o n in ( x , y , z ) dimensions dy ,dz ,ux , / / Momentum in ( x , y , z ) dimensions uy ,uz ,w, / / Weight o f s i m u l a t e d p a r t i c l e } ; } ;View < int * > p a r t i c l e i n d i c e s (N) ; / / P a r t i c l e i n d i c e s/ / Access dx from p a r t i c l e 0 p a r t i c l e s ( 0 , p a r t i c l e v a r : : dx ) ; (b) VPIC macroparticle using a Kokkos View. Fig. 2: Code changes for storing and accessing particle datausing Kokkos.3 floats for normalized momentum, ux , uy , uz , 1 float forweight, and 1 integer for cell index).The EM fields are represented in the original C++ VPIC1.0 code as a 80-byte structure (i.e., 3 floats for the (cid:126)E field, 1float for the divergence of (cid:126)E error, 3 floats for the (cid:126)B field, 1float for the divergence of (cid:126)B error, 3 floats for free current,1 float for charge density, 3 floats for the transverse currentadjustment field, 1 float for the bound charge density, and8 short integers for identifying the various materials atdifferent places in the grid); interpolators are represented as72-byte structure (i.e., 18 floats for the various interpolatorvalues); and current accumulators are represented as 48-bytestructure (i.e., 4 floats for the current in the x direction, 4floats for the y direction, and 4 floats for the z direction).Porting VPIC to use Kokkos requires that we store thedata structures in Kokkos Views. Care must be taken duringthe conversion to Kokkos Views because the conversion ap-proach directly affects the memory layout. Memory layoutis important for performance, with different hardware per-forming better with different layouts and parallelism tech-niques (i.e., thread parallelism and vector parallelism) [30].In general there are two major memory layouts, array ofstructs (AoS) and struct of arrays (SoA). The AoS storesmultiple instances of a struct in an array such that theyare contiguous in memory. The SoA layout, on the otherhand, stores multiple instances of a struct in a single struct with arrays for each structure member. Each mem-ber stores values in a contiguous array. Fig. 3 demonstratesthe differences between the layouts using particle positioncoordinates as an example. The AoS layout has each co-ordinate (dx,dy,dz) in memory one after the other. TheSoA layout has separate arrays for each dimension. The AoSlayout is favored by traditional CPU hardware. The array EBRUARY 17, 2021 5 is easily broken into chunks that each thread can operateon independently. Cache efficiency on CPUs also improveswith the AoS layout due to the higher likelihood of fittingan entire struct in a single cache line. On the other hand,the SoA layout is preferred for hardware with small cachessuch as GPUs. On GPUs, many threads execute the sameinstruction. The small cache size and high memory accesslatency makes each memory operation very costly. The SoAlayout ensures that memory is accessed in a contiguousmanner. Accessing contiguous blocks of memory helps theGPU use its full memory bandwidth.Fig. 3: Comparison of Struct of Arrays (SoA) and Array ofStructs (AoS) memory layouts for storage position in 3Dcartesian coordinates.Simply replacing arrays with Views may be inefficientas this decision limits memory layout and decreases perfor-mance when, for example, running VPIC on GPUs. Swap-ping arrays for Views restricts the data to a CPU favoringAoS layout. Instead of swapping, we use a 2D View to storedata such that the layout is automatically chosen based onthe target hardware (i.e., CPU or GPU). Fig. 2 shows theapplication of this principle when moving VPICs particledata to a Kokkos View. The original VPIC macroparticlecode (VPIC 1.0) in Fig. 2a stores particles in an array ofparticles. Fig. 2b shows how the data is stored and accessedin a 2D Kokkos View in VPIC 2.0. The first and second indexdetermine which particle and member variable to accessrespectively.While converting structures to 2D Views, we considertwo types of structures: structures with all members havingthe same member type and structures with multiple mem-ber types. Data with structures containing a single membertype (e.g., only float) are simply moved to a 2D KokkosView where the first dimension determines the View sizewhile the second dimension corresponds to the different struct members. For data with structures containing mul-tiple member types, such as the the macroparticle in Fig. 2a,members of a given type are gathered in their own Viewand multiple Views are deployed. Fig. 2b shows how theparticle cell index is separated into its own View due tohaving a different type than the other macroparticle items.Table 1 summarizes the types of members and Views for thefour key VPIC data elements.For convenience, we define an enum with enumeralsmatching the original struct fields. An enum provides aclear definition for accessing struct fields stored in Views.Fig. 2 shows the difference in code for the particles. InFig. 2a the particle is defined using plain C++. Transformingthe code with Kokkos results in the particle definition in
Data Structure Member Types TransformationParticles float and int
Multiple ViewsEM Fields float and short int
Multiple ViewsInterpolators float
Single ViewAccumulators float
Single View
TABLE 1: Member types and necessary code transforma-tions for the four key data structures. Heterogenous datarequires separate Kokkos Views for each member type.Homogeneous structures only need a single Kokkos View.Fig. 2b. Both the declaration and data access code for particlestorage is similar to the original VPIC 1.0 code from the. This conversion approach provides a single interface tomanaging data across multiple platforms and allows do-main experts to focus on scientific discovery.
VPIC has several functions that allow the user to collectdiagnostics and adjust the simulation. Users define theirdesired adjustments and diagnostics in C++ along with thefrequency at which VPIC should call the functions. PortingVPIC to Kokkos requires changes to the user interface forbackwards compatibility and performance.VPIC data is stored in Kokkos Views with an explicitseparation between host and device data. This separation isa requirement for parallel execution and needs changes tothe code to maintain backwards compatibility with existingVPIC decks. The existing interface assumes all operationsare done on the old data structures. Thus data must becopied between the old structures and Kokkos Views. Theexplicit split between host and device data requires thatboth structures are synchronized before and after any userdiagnostics or adjustments. As a result, data must be copiedfrom the device to the host and copied again to the originalVPIC data structures. We add a set of flags for controllingdata movement for user defined functions. By default, VPICautomatically handles data movement such that legacycodes run without changes.Shuffling data among the host, device, and originalstructures is costly. Users can avoid the performance impactfrom moving data by writing their user functions withKokkos. Functions written using the Kokkos API can di-rectly operate on the data regardless of where it resides.Thus, unnecessary data movement is avoided. This featurealso enables users to easily paralellize their code and inte-grate it into VPIC seamlessly. Under these circumstances,users need only to write their functions and set the appro-priate flags.
We perform additional code re-engineering to further im-prove portability and performance.
There are two changes to the VPIC code that enable broaderportability across platforms. The first, we move away fromthe vectorization formulation heavily deployed in VPIC 1.0.Second, we replace the explicit data duplication used forscatter-reduction patterns with the more portable Scatter-View container provided by Kokkos.
EBRUARY 17, 2021 6
The original VPIC 1.0 code was designed to exploitvectorization with explicit SIMD code. While highly per-forming on CPUs, the formulation does not fit the GPUprogramming model and adds significant complexity. Were-engineer the code to match the Kokkos model instead;such a model generalizes well across both CPUs and GPUs.As the Kokkos project develops, we will improve the codeto better expose vector parallelism.A ScatterView is a helpful structure for concurrent re-ductions. Concurrent updates are implemented in two waysdepending on the target hardware. On CPUs, each threadduplicates the data and operates locally before collectingupdates from all threads and updating the data. This isdone to avoid slow atomic operations and improve cacheperformance. On GPUs, caches are smaller and atomic oper-ations are faster so updates are done with atomics. Electircalcurrent accumulation in VPIC uses arrays of accumulatorsso that each thread can collect contributions independentlybefore updating the EM fields. A ScatterView accomplishesthe same goal, so we replace the accumulator arrays with aScatterView that updates the current directly.
For improving performance on GPU systems we integratefour optimizations in VPIC: restructuring code to keep dataresident in the GPU; exploiting hierarchical parallelism;changing particle sorting order; and specializing the reduc-tion code.Data movement in Kokkos is done explicitly. In our firstoptimization, we restructure VPIC to avoid data transfersdue to the high cost of data movement in modern systems.Data and simulation settings are set on the host CPU duringinitialization by default. Simulation data is then copied tothe execution device and kept there throughout the sim-ulation. Data movement between device and host is onlydone for communication and user diagnostics. On CPU-onlyplatforms, where host and device are the same, data copiesare eliminated automatically. This code structure ensureshigh performance on GPUs while Kokkos prevents anyunnecessary data movement for CPU only systems.Executing code in parallel introduces several tuningparameters for controlling how work is partitioned betweenthe parallel threads of execution. The default settings de-termined by Kokkos achieve good performance and cleancode. Still, performance can be further improved using hier-archical parallelism. Kokkos supports three levels of hierar-chical parallelism: (1) thread teams, (2) individual threads,and (3) vector parallelism. In our second optimization, weapply hierarchical parallelism to the particle pusher as it isresponsible for the majority of the computation. By explic-itly defining the number of teams and threads per team, weexert more control over data access patterns and reuse. Thisoptimization improves performance for the particle pusherand is easily applied to multiple target architectures.Particle’s sorting is a performance optimization in VPICthat significantly speeds up the particle pusher. The particlepusher must load interpolation data based on the particle’scell index. Current accumulation is also based on the cellindex. Each cell must accumulate the generated current fromall particles inside the cell. Thus keeping particles sortedbased on cell index improves cache reuse. This feature is applied for CPU architectures as the GPU memory hierarchyrequires a change in sorting order. The traditional sortingorder results in many threads trying to access the same ad-dress which prevents the GPU from using the full memorybandwidth. In our third optimization, we address this issueby sorting the particles such that threads will access memorycontiguously (e.g.,thread 0 accesses entry 0 and thread 1accesses entry 1). This implementation not only improvesthe memory access pattern but also reduces the number ofsimultaneous writes to the same memory location.GPU memory and atomic operation performance variesgreatly between different manufacturers and hardware gen-erations. Our fourth and final optimization uses the interop-erability capabilities of Kokkos to implement target specificthread team reductions on GPUs. In particular, we introducespecialized CUDA code using warp intrinsics to acceleratecurrent accumulation and reduce the number of atomicwrites. Performance improvements are hardware depen-dent. The specialized code provides significant performanceimprovements for GPUs.
ERFORMANCE AND P ORTABILITY S TUDY
Our study of the newly optimized VPIC 2.0 code’s perfor-mance and portability is executed on a diverse group ofhardware platforms. For the performance, we consider bothweak and strong scaling. For the portability, we presentinsights on the effectiveness of VPIC to run across multiplearchitectures.
We present results from five GPU-based platforms, andfour CPU-based platforms. The architectures are sortedchronologically based on the year of release. The platformconfigurations are as follows.
NVIDIA Pascal P100 (2016) experiments are run on theKodiak computer at Los Alamos National Laboratory. Thiscomputer consists of 133 nodes each with an Intel XeonE5-2695 v4 CPU with 256 or 512 GB of DRAM and fourNVidia Tesla P100 GPUs on a Mellanox InfiniBand EDRFat-Tree interconnect.
NVIDIA Volta V100 (2017) experiments are run on theSierra supercomputer at Lawrence Livermore NationalLaboratory. Sierra is a 125 petaflop, IBM Power SystemsAC922 hybrid architecture system comprised of 4320 nodes,each with two 44-core IBM Power9 processors running at2.3–3.8 GHz with 256 GB of RAM and four NVIDIA V100GPUs each with 16 GB of HBM2 DRAM.
NVIDIA Turing RTX 5000 (2018) experiments use the RTXpartition of the Frontera cluster at the Texas AdvancedComputing Center. This partition consists of nodescontaining two Intel Xeon CPU E5-2620 v4 running at2.10 GHz with 128 GB of RAM. Nodes are connectedby Mellanox Infiniband FDR at 56 Gb/s configured in afat-tree topology with a 4:1 oversubscription. Each nodealso contains four NVIDIA Quadro RTX 5000 GPUs with
EBRUARY 17, 2021 7
16 GB of GDDR6 DRAM.
AMD Vega Radeon VII (2019) experiments are executedon a single node consisting of two AMD EPYC 7302 16-coreprocessors with 64 GB of DRAM and four AMD Radeon VIIGPUs with 16 GB of HBM2 DRAM each.
NVIDIA Ampere A100 (2020) experiments are completedon the NVIDIA DGX SuperPod [31], Selene, at NVIDIA.Each node within this system contains dual AMD EPYC7742 CPUs with 64 cores per socket running at 2.25 GHzand 2 TB of DRAM. The nodes are connected togetherusing Melanox Infiniband HDR at 200 Gb/s within a fat-tree topology that is configured for a 5:4 oversubscription.Each node contains eight NVIDIA A100 GPUs with 80 GBof HBM2 DRAM each.
Intel Knights Landing (2016) experiments use theKNL partition of the Trinitite testbed at Los AlamosNational Laboratory. This partition consists of 100 nodesof a Cray XC40 system configured with a 3D AriesDragonfly interconnect. Each node contains one 68 coreKnights Landing processor with 16 GB of high-bandwidthMCDRAM and 96 GB of DRAM.
AMD Zen 2 (2019) experiments are run on the Bell clusterat Purdue University [32]. Each node contains two AMDEPYC 7662 64 core processors operating at 2.0 GHz and256 GB of DRAM. Each node is connected with MellanoxInfiniband HDR at 100 Gb/s in a fat-tree configuration thatis oversubscribed 3:1.
Intel Cascade Lake (2019) experiments are executed onthe Frontera cluster at the Texas Advanced ComputingCenter [33]. Each node in Frontera contains two IntelXeon Platinum 8280 processors with 28 cores operatingat 2.70 GHz and 192 GB of memory. The nodes areconnected with Mellanox Infiniband HDR at 100 Gb/s. Theinterconnect is configured in a fat-tree topology with anoversubscription factor of 11:9.
Fujitsu A64FX (2020) experiments were ran on the Ookamicluster at Stony Brook University. Ookami is an HPE Apollo80 system with 174 nodes. Each node contains a FujitsuA64FX Arm processor [34] with 32 GB of high-bandwidthmemory and a 512 GB SSD. Nodes are connected in a 2-level tree using non-blocking HDR-200.
For a test problem to study weak scaling, we adapt a large3D simulation of stimulated Brillouin scattering (SBS) in asingle laser speckle [19] relevant for inertial confinementfusion experiments. Running the full simulation to com-pletion would be prohibitive in our scaling tests, so wemodify the simulation to be much shorter while preservingthe numerical properties of the full simulation. Instead of asingle very long laser pulse with a slow ramp in intensity,we use two counter-propagating laser pulses that ramp uplinearly in field strength (quadratically in intensity) from zero at the start to twice the intensity of [19] at the endof the simulation. The simulation is stopped just before thetwo pulses collide. This arrangement gets particles movingnear the start of the simulation, and thus is computationallysimilar to the majority of a full length simulation, and lesslike the uninteresting uniform cold plasma of early timesteps.We simulate a fully ionized nitrogen plasma of tem-peratures T e = 600 eV and T i = 150 eV, with density n e = 0 . n cr , where n cr = m e ε ω L /e is the critical densityfor a laser with angular frequency ω = 2 πc/λ L . The plasmafills a simulation volume of µ m × µ m × µ m . Twoidentical laser pulses are incident on the plasma from thepositive and negative x direction. Each pulse is a focusedGaussian beam of wavelength λ L = 527 nm and beamwaist w =4 µ m , focused at the center of the simulationvolume. The pulses increase linearly in field strength overtime to a maximum intensity I = 5 × W/cm atthe end of the simulation, t stop = (80 µ m / /c . The gridhas resolution 6000x1500x1500, for a cell size very near theDebye length, and a domain decomposition of 60x15x15across 13,500 GPUs. Tests smaller than this full scale simu-late a proportional fraction of the volume and use only onelaser pulse. Tests using 32 or fewer GPUs have an adjustednumber of particles per cell to account for boundary condi-tions (there is a “buffer” region at the simulation boundarywhere particles cannot be placed to separate particle andfield boundary conditions), making the maximum numberof particles a rank has at initialization equal for all tests.We run the weak scaling on three cutting-edge, GPU-basedplatforms (i.e., Sierra with Power 9 CPUs and V100 GPUs,Selene with EPYC 7742 CPUs and A100 GPUs, and theFrontera with Xeon E5-2620 v4 CPUs and RTX 5000 GPUs).The reported runtime is measured on up 7,200 GPUs onSierra, 1,024 on Selene, and 256 on Frontera; the times areon to normalized to 64 GPU runs. Fig. 4 demonstratesthe results from the three GPU-based platforms and theirdifferent GPUs, spanning both commercial and consumergrade cards. Runtime measurements for Frontera (RTX 5000)are the average of five runs with a standard deviation of2.268. VPIC runtime in general is very consistent. Sierra(V100) and Selene (A100) measurements are from a singlerun. The weak scaling results show excellent performanceacross the GPU architectures considered. The best scalingperformance is shown by the NVIDIA Volta (V100) hard-ware on the Sierra supercomputer. The tests on the Sierra’s7,200 GPUs show only a 3.45% slowdown versus the 64GPU run. This is a testament to the capability of VPIC 2.0 topreserve performance while gaining in portability. Other theplatforms (i.e., Selene and Frontera) scale about half as wellas Sierra on the configurations that were made available tous for our testing.We demonstrate the ability of VPIC 2.0 to run on CPU-based architectures for two platforms: the Bell cluster atPurdue University using EPYC 7662 CPUs and the Fronteracluster with Xeon 8280 CPUs only (without the use ofGPUs). Fig. 5 shows the weak scaling on the two platforms.The runtime values are the average of five runs and arenormalized to the 64 CPU runs, as in the GPU-based testing.The presented runtimes are the average of five runs. Forour tests, we run multiple MPI ranks per socket, thus the EBRUARY 17, 2021 8 . . . . Number of GPUs N o r m a li z e d r u n t i m e V100A100RTX 5000Fig. 4: Weak Scaling on three GPU-based platforms (i.e.,Sierra with Power 9 CPUs and V100 GPUs, Selene withEPYC 7742 CPUs and A100 GPUs, and the Frontera withXeon E5-2620 v4 CPUs and RTX 5000 GPUs). Runtime isnormalized to the 64 GPU runs, below which boundariescause a reduction in per GPU communication.per socket communication is constant for 4 or more sockets.While these performance results are notably slower than theGPU runs in absolute terms, they still add a lot of valuefrom the perspective of our portability study. Furthermore,as accelerators become the driving architectures for largescale simulations, the community gathering around VPIC isexpected to use GPU-based platforms rather than CPU-onlyones. . . . . CPU Sockets N o r m a li z e d r u n t i m e EPYC 7662Xeon 8280Fig. 5: Weak scaling on two CPU-based platforms (i.e., Bellcluster at Purdue University using EPYC 7662 CPUs and theFrontera cluster with Xeon 8280 CPUs only). Runtimes arenormalized to the 64 CPU runs to match the GPU tests.
The cache increase on the A100 to 40 MB from 6 MB onthe V100 has tangible implications for how these two cardsare best utilized in VPIC 2.0. A significant part of the par-ticle push is depositing the current generated from particle movement to grid points. VPIC 2.0 periodically sorts theparticles by voxel such that the number of grid points beingwritten to is minimal at any given time. Minimizing thenumber of grid points to write to, allows the grid points tobe stored in cache, resulting in faster writes. The A100 raisesthe intriguing possibility of storing the entire grid in cache,obviating the particle sort, and possibly producing super-linear speedup behavior in a strong scaling test as the totalavailable cache increases.We investigate this feature with tests similar to the weakscaling tests above, but with particle sorting turned off andonly using the two larger supercomputers (i.e., Sierra withthe V100 and Selene with the A100). The number of particlesis kept constant while grid size varies. Fig. 6 shows thenumber of particle pushes per nanosecond as a function ofgrid size. The figure confirms that there is indeed a grid sizewhere particle push performance increases dramatically,both for V100 and A100. Performance peaks for the V100with approximately 4 particle pushes per nanosecond at13,824 grid points. On the other hand, the A100 reaches 6particle pushes per nanosecond with a grid size of 85,184.Further more, the performance jump occurs at roughly 6times more grid points for A100 vs. V100, similar to theincrease in cache. If this increase is from cache effects, weshould see super-linear speedup in the following strongscaling tests. We suspect the performance drop for very fewgrid points, and thus extremely high particles per cell, onA100 is a result of colliding atomic writes in the currentdeposition. . . Millions of grid points P a r t i c l e p u s h r a t e ( / n s ) V100A100Fig. 6: Number of particle pushes per nanosecond as afunction of grid size on Sierra with V100 and Selene withA100. The total number of particles is held constant and allsimulation time not in the particle push is excluded.Carefully selecting the size of our grid to match the peakperformance in Figure 6, we run strong scaling tests onSierra with V100 GPus (i.e., from 1 to 32 GPUs), as shownin Fig. 7, and on Selene with A100 GPUs (i.e., from 8 to 512GPUs), as shown in Fig. 8. As suggested by Fig. 6, we doindeed observe super-linear scaling for both architectures.For V100, we observe a 25x speedup for an 8x increase ofGPUs, from 1 to 8. Beyond 8, the GPUs are very emptyand communication overhead starts to cancel out the super-linear speedup. On A100, we see a 19x speedup for an 8x
EBRUARY 17, 2021 9 increase of GPUs, from 8 to 64. Unlike the V100, we see near-ideal scaling all the way to 512 GPUs, the largest allocationwe were able to test. , Number of GPUs R u n t i m e ( s ) V100IdealFig. 7: Strong scaling on the Sierra with V100 GPUs. As thenumber of GPUs increases, more of the grid is kept in cache;as a result we get super-linear speedup. Communicationcosts eventually catch up with more than 16 GPUs.The super-linear speedup from storing the grid in thecache has practical applications for certain classes of prob-lems. Simulations with fewer grid points or high particle percell requirements (e.g., cross-beam energy transfer inertialconfinement fusion simulations [20]) can cut their runtimedown significantly.Smaller, non-full-scale simulations also benefit signifi-cantly from the super-linear speedup. The faster runtimeenables domain experts and developers to quickly iteratethrough test parameters and tune their simulations beforemoving on to full-scale runs.
Our portability study assesses how effectively VPIC 2.0, asingle source code, can run across multiple architectures. , Number of GPUs R u n t i m e ( s ) A100IdealFig. 8: Strong scaling on the Selene with A100 GPUs. TheA100 GPUs maintain super-linear speedup with more GPUsthan the V100 test thanks to the larger cache. To show to achieved portability, we present performance ofeach architecture listed in Sec. 5.1 for VPIC 2.0.Fig. 9 shows a comparison of the different runtime mea-surements in seconds for the GPU-based platforms whenrunning on 8 GPUs, with 1 MPI rank per GPU. The totalsimulation time is shown and is predominately computetime, as communication time is minimal at 8 MPI ranks. Forthe NVIDIA based platforms (i.e., P100, V100, RXT 5000,and A100) we see a large jump between P100 and V100,which can be explained by the increased atomic operationsupport in the V100. We attribute the further jump fromV100 to A100 to the increase in both cache size and FP32performance. The RTX 5000, while slower than the V100,performs admirably for a workstation card and proves itselfvery capable of single-precision scientific simulations. TheAMD Radeon VII underperforms when compared to theRTX 5000, despite having similar theoretical compute per-formance and superior memory bandwidth. The differencein performance can be attributed to two factors: (1) thematurity of the Kokkos AMD backend; and (2) the fact thatto date NVIDIA cards have been used as the developmentand testing platform for this work. The performance of VPIC2.0 may further increase as Kokkos improves the AMD back-end. P V R T X R a d e o n V II A , , R u n t i m e ( s ) Fig. 9: Total simulation time for small scale GPU perfor-mance (8 cards). Results are ordered chronologically (basedon the GPU releases).Fig. 10 provides a comparison of CPU-based platforms,including chips from AMD, Intel, and ARM/Fujitsu. For testnormalization we compare 8 sockets of each platform, whichrepresents a regime where MPI costs are non-negligible.While it is an impressive feat that the same source codeof VPIC 2.0 can run on such a diverse range of hardware,it is interesting to note that the AMD Zen 2 is the onlyplatform that achieves performance comparable to thatshown in the GPU study. We attribute the lower perfor-mance on CPU-based platforms to the limited vectorizationsupport in the current version of Kokkos (version 3.3). Forfurther optimizations, one could formulate the VPIC 2.0code expression that is better able to vectorize. We do notaddress vectorization in this work as that introduces non-trivial code divergence, and the focus of this study is onachieving portability over platform specific performance. It
EBRUARY 17, 2021 10 is our belief that for cutting-edge performance on a specifictarget platform, some level of specialization will always berequired. In many cases the portability abstraction layermay be able to achieve this, but other times the burden willfall to the programmer. K N L E P Y C X e o n F u j i t s u A F X , , , R u n t i m e ( s ) ComputeCommunicationFig. 10: Compute and communication time for small scaleCPU performance (8 sockets). Results are ordered chrono-logically (based on the CPU releases).With a portability study such as ours that relies on an un-derlying toolkit to facilitate the codes operation on a varietyof platforms, it is somewhat inevitable that the performanceresults reflect the priorities and maturity of the framework,as well as reflecting the facets of the target hardware. Itis clear from these results that the code, as written in itspresent state, maps best to GPU platforms. Proposals havebeen made to extend Kokkos SIMD support, and we lookforward to being able to leverage these improvements forVPIC 2.0.
CIENTIFIC I MPACTS OF
VPIC 2.0
The ability for any code to readily deploy itself on a newmachine, with limited manual refactoring effort, is a hugewin that actively drives down the time to scientific discov-ery. As the heterogeneity of modern platforms continues toincrease to meet the demands of exascale, effective use ofthese platforms translates directly to codes such as VPICbeing able to run simulations at unprecedented scales, andopens the door to performing analysis and investigationwhich was previously intractable. Sec. 5.2 demonstrates thatVPIC 2.0, by following the methodology laid out in thispaper, is well prepared for the exascale challenge.While enabling portability, VPIC 2.0 is also able to as-sure performance. The performance-portability trade-off al-lows simulations to complete quickly and without multiplerestarts by using accelerators and emerging architectures;such platforms are not optimized or supported by theoriginal VPIC code (VPIC 1.0). The reduction in executioncosts when using accelerators, coupled with the portabilitypresented in this paper, facilitate the following changes inVPIC simulations: • Simulating more timesteps in fewer core-hours en-ables simulations which advance further in time, allowing us to set our sights towards performinglonger fully resolved electromagnetic simulationsthan were previously possible. • Advanced and expensive diagnostics can now beperformed within the timestep that allows for moreaccurate analysis. The extensive analysis enables newfundamental insights to particle acceleration model-ing at unprecedented scales [35]. • Previous simulations, which took considerable re-sources, can now be run multiple times to analyzestochastic variation in, for example, short-pulse laserexperiments, which in turn can generate sufficientdata for machine learning analysis.The newly release VPIC code opens up new avenues ofscience, carrying the domain of plasma physics research toexascale level simulations
ELATED W ORK
While the computational characteristics of PIC codes arefairly well understood, much of the communities optimiza-tion and development efforts to date have been focusedaround platform and hardware specific optimization [36],[37]. As heterogeneity within leading supercomputers in-creases [38], it is no longer viable to periodically optimize fora single target platform, making portability more importantthan ever. Within the PIC community, portability is under-explored, and our work seeks to address this weakness.Example projects such as PIConGPU [39], [40] and theAMITIS project [41] demonstrate the applicability of thePIC algorithm to GPU architectures, but do not make aconcerted effort to address the portability issues presentedby the increasing diversity of modern HPC platforms.We leverage the Kokkos framework to address manyaspects of the performance-portability trade-off. WhileKokkos represents a fairly modern approach to code porta-bility, it has already established itself as a key technologyin the race for performance portability, and has successfullyhelped a variety of applications run on multiple large-scaleplatforms. These target applications span science domainsincluding finite element analysis [42], computational fluiddynamics [43], and even deep neural networks [44]. It is ofparticular note previous work by the molecular dynamicscommunity to leverage Kokkos for performance portability[45]–[47]. Molecular dynamics is an important workload forSandia National Laboratory, the creators of Kokkos; and isalso the best explored workload that shares computationalsimilarities with the PIC method. Both methods are par-ticle dominated, and rely on the fast processing of largeamounts of particles to simulate the kinetic level behaviorof a large system. It is important to note, however, thatKokkos is not the only project which seeks to address issuesof performance-portable code development. These tools canbe split into two types: high-level frameworks (such asKokkos, RAJA [6], or HPX [48]) and lower-level tools forwhich code can be generated (such as OpenMP, SYCL, orHIP). A comparison of these technologies has already beenthe subject of previous studies [49], [50]. Our selection ofKokkos to support VPIC’s portability is based on the factthat Kokkos allows for high-level programmability whilestill offering good low-level performance.
EBRUARY 17, 2021 11
ONCLUSION
In this work we present a performance-portable version ofVPIC, a kinetic Particle-in-Cell code that has been used to dosome of the worlds largest plasma physics simulations. Thisversion of VPIC (VPIC 2.0) relies on the Kokkos frameworkfor portability. We capture the VPIC 2.0 performance on ninedifferent platforms (including three of the top 10 Top500systems) and scale our tests on up to , GPUs. Wedemonstrate super-linear strong scaling on the NVIDIAA100 architecture and provide insights on the scientificimpacts of VPIC 2.0 on scientific discovery for the nextgeneration exascale supercomputers.
CKNOWLEDGMENTS
Work performed under the auspices of the U.S. DOE byTriad National Security, LLC, and Los Alamos NationalLaboratory(LANL). This work was supported by the LANLASC and Experimental Sciences programs. Approved forpublic release: LA-UR-21-21453. The UTK authors acknowl-edge the support of LANL under contract R EFERENCES [1] K. J. Bowers, B. J. Albright, B. Bergen, L. Yin, K. J. Barker, andD. J. Kerbyson, “0.374 pflop/s trillion-particle kinetic modeling oflaser plasma interaction on roadrunner,” in
SC’08: Proceedings ofthe 2008 ACM/IEEE conference on Supercomputing . IEEE, 2008, pp.1–11.[2] K. J. Bowers, B. Albright, L. Yin, B. Bergen, and T. Kwan, “Ultra-high performance three-dimensional electromagnetic relativistickinetic plasma simulation,”
Physics of Plasmas , vol. 15, no. 5, p.055703, 2008.[3] K. J. Bowers, B. J. Albright, L. Yin, W. Daughton, V. Roytershteyn,B. Bergen, and T. Kwan, “Advances in petascale kinetic plasmasimulation with vpic and roadrunner,” in
Journal of Physics: Con-ference Series , vol. 180, no. 1. IOP Publishing, 2009, p. 012055.[4] D. J. Stark, L. Yin, B. J. Albright, W. Nystrom, and R. Bird, “Adetailed examination of laser-ion acceleration mechanisms in therelativistic transparency regime using tracers,”
Physics of Plasmas ,vol. 25, no. 4, p. 043114, 2018.[5] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos:Enabling manycore performance portability through polymorphicmemory access patterns,”
Journal of Parallel and DistributedComputing
Parallel Computing , vol. 34, no.4-5, pp. 261–277, 2008.[8] C.-Y. Cher and M. Gschwind, “Cell gc: using the cell synergisticprocessor as a garbage collection coprocessor,” in
Proceedings ofthe fourth ACM SIGPLAN/SIGOPS international conference on Virtualexecution environments , 2008, pp. 141–150.[9] I. A. Surmin, S. I. Bastrakov, A. A. Gonoskov, E. S. Efimenko, andI. B. Meyerov, “Particle-in-cell plasma simulation using intel xeonphi coprocessors,”
Numerical methods and programming , vol. 15, pp.530–536, 2014.[10] V. V. Kindratenko, J. J. Enos, G. Shi, M. T. Showerman, G. W.Arnold, J. E. Stone, J. C. Phillips, and W.-m. Hwu, “Gpu clus-ters for high-performance computing,” in . IEEE, 2009, pp.1–8.[11] G. Chrysos, “Intel xeon phi coprocessor-the architecture,” Intel,Tech. Rep., 2014.[12] J. Lin, Z. Xu, L. Cai, A. Nukada, and S. Matsuoka, “Evaluatingthe sw26010 many-core processor with a micro-benchmark suitefor performance optimizations,”
Parallel Computing
SC18: International Conference for High Performance Computing,Networking, Storage and Analysis . IEEE, 2018, pp. 685–696.[14] C. R. Trott, S. J. Plimpton, and A. P. Thompson, “Solving theperformance portability issue with kokkos.” Sandia NationalLab.(SNL-NM), Albuquerque, NM (United States), Tech. Rep.,2017.[15] L. Yin, B. Albright, B. Hegelich, K. J. Bowers, K. Flippo, T. Kwan,and J. Fern´andez, “Monoenergetic and gev ion acceleration fromthe laser breakout afterburner using ultrathin targets,”
Physics ofplasmas , vol. 14, no. 5, p. 056706, 2007.[16] L. Yin, B. Albright, K. Bowers, D. Jung, J. Fern´andez, andB. Hegelich, “Three-dimensional dynamics of breakout after-burner ion acceleration using high-contrast short-pulse laser andnanoscale targets,”
Physical review letters , vol. 107, no. 4, p. 045003,2011.[17] A. Levinson and B. Cerutti, “Particle-in-cell simulations of pairdischarges in a starved magnetosphere of a kerr black hole,”
Astronomy & Astrophysics , vol. 616, p. A184, 2018.[18] M. Hoshino, “Angular momentum transport and particle accel-eration during magnetorotational instability in a kinetic accretiondisk,”
Physical review letters , vol. 114, no. 6, p. 061101, 2015.[19] B. Albright, L. Yin, K. Bowers, and B. Bergen, “Multi-dimensionaldynamics of stimulated brillouin scattering in a laser speckle: Ionacoustic wave bowing, breakup, and laser-seeded two-ion-wavedecay,”
Physics of Plasmas , vol. 23, no. 3, p. 032703, 2016.[20] L. Yin, B. J. Albright, D. J. Stark, W. D. Nystrom, R. F. Bird, andK. J. Bowers, “Saturation of cross-beam energy transfer for multi-speckled laser beams involving both ion and electron dynamics,”
Physics of Plasmas , vol. 26, no. 8, p. 082708, 2019.[21] S. Du, F. Guo, G. P. Zank, X. Li, and A. Stanier, “Plasma energiza-tion in colliding magnetic flux ropes,”
The Astrophysical Journal ,vol. 867, no. 1, p. 16, 2018.[22] M. Yamada, J. Yoo, J. Jara-Almonte, H. Ji, R. M. Kulsrud, andC. E. Myers, “Conversion of magnetic energy in the magneticreconnection layer of a laboratory plasma,”
Nature communications ,vol. 5, no. 1, pp. 1–8, 2014.[23] L.-J. Chen, M. Hesse, S. Wang, N. Bessho, and W. Daughton,“Electron energization and structure of the diffusion region duringasymmetric reconnection,”
Geophysical Research Letters , vol. 43,no. 6, pp. 2405–2412, 2016.[24] J. Villasenor and O. Buneman, “Rigorous charge conservation forlocal electromagnetic field solvers,”
Computer Physics Communica-tions , vol. 69, no. 2-3, pp. 306–316, 1992.[25] T. Z. Esirkepov, “Exact charge conservation scheme for particle-in-cell simulation with an arbitrary form-factor,”
Computer PhysicsCommunications , vol. 135, no. 2, pp. 144–153, 2001.[26] A. Taflove and S. C. Hagness,
Computational electrodynamics .Artech house publishers, 2000.[27] K. S. Yee et al. , “Numerical solution of initial boundary value
EBRUARY 17, 2021 12 problems involving maxwell’s equations in isotropic media,”
IEEETrans. Antennas Propag , vol. 14, no. 3, pp. 302–307, 1966.[28] C. K. Birdsall and A. B. Langdon,
Plasma physics via computersimulation . CRC Press, 2004.[29] J. D. Blahovec, L. A. Bowers, J. W. Luginsland, G. E. Sasser, andJ. J. Watrous, “3-d icepic simulations of the relativistic klystronoscillator,”
IEEE transactions on plasma science , vol. 28, no. 3, pp.821–829, 2000.[30] R. Strzodka, “Abstraction for aos and soa layout in c++,” in
GPUcomputing gems Jade edition . Elsevier, 2012, pp. 429–441.[31] C. Tierney, P. Savla, S. Ellis, , and R. Sohigian, “Nvidia dgx su-perpod: Scalable infrastructure for ai leadership,” NVIDIA, Tech.Rep. RA-09950-001, 11 2020.[32] G. McCartney, T. Hacker, and B. Yang, “EmpoweringFaculty: A Campus Cyberinfrastructure Strategyfor Research Communities,”
Educause Review , 2014.[Online]. Available: https://er.educause.edu/articles/2014/7/empowering-faculty-a-campus-cyberinfrastructure-strategy-for-research-communities[33] D. Stanzione, J. West, R. T. Evans, T. Minyard, O. Ghattas, andD. K. Panda, “Frontera: The evolution of leadership computingat the national science foundation,” in
Practice and Experience inAdvanced Research Computing , ser. PEARC ’20. New York, NY,USA: Association for Computing Machinery, 2020, p. 106–111.[Online]. Available: https://doi.org/10.1145/3311790.3396656[34] M. Sato, Y. Ishikawa, H. Tomita, Y. Kodama, T. Odajima, M. Tsuji,H. Yashiro, M. Aoki, N. Shida, I. Miyoshi, K. Hirai, A. Furuya,A. Asato, K. Morita, and T. Shimizu, “Co-design for a64fx many-core processor and ”fugaku”,” in
Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis , ser. SC ’20. IEEE Press, 2020.[35] F. Guo, X. Li, W. Daughton, P. Kilian, H. Li, Y.-H. Liu, W. Yan,and D. Ma, “Determining the dominant acceleration mechanismduring relativistic magnetic reconnection in large-scale systems,”
The Astrophysical Journal Letters , vol. 879, no. 2, p. L23, 2019.[36] I. Surmin, S. Bastrakov, Z. Matveev, E. Efimenko, A. Gonoskov, andI. Meyerov, “Co-design of a particle-in-cell plasma simulation codefor intel xeon phi: a first look at knights landing,” in
InternationalConference on Algorithms and Architectures for Parallel Processing .Springer, 2016, pp. 319–329.[37] H. Nakashima, “Manycore challenge in particle-in-cell simulation:how to exploit 1 tflops peak performance for simulation codeswith irregular computation,”
Computers & Electrical Engineering ,vol. 46, pp. 81–94, 2015.[38] A. Khan, H. Sim, S. S. Vazhkudai, A. R. Butt, and Y. Kim, “Ananalysis of system balance and architectural trends based ontop500 supercomputers,” in
The International Conference on HighPerformance Computing in Asia-Pacific Region , 2021, pp. 11–22.[39] H. Burau, R. Widera, W. H¨onig, G. Juckeland, A. Debus, T. Kluge,U. Schramm, T. E. Cowan, R. Sauerbrey, and M. Bussmann, “Pi-congpu: a fully relativistic particle-in-cell code for a gpu cluster,”
IEEE Transactions on Plasma Science , vol. 38, no. 10, pp. 2831–2839,2010.[40] E. Zenker, R. Widera, A. Huebl, G. Juckeland, A. Kn ¨upfer,W. E. Nagel, and M. Bussmann, “Performance-portable many-core plasma simulations: Porting picongpu to openpower andbeyond,” in
International Conference on High Performance Computing .Springer, 2016, pp. 293–301.[41] S. Fatemi, A. R. Poppe, G. T. Delory, and W. M. Farrell, “Amitis: A3d gpu-based hybrid-pic model for space and plasma physics,” in
Journal of Physics: Conference Series , vol. 837, no. 1. IOP Publishing,2017, p. 012017.[42] I. Demeshko, J. Watkins, I. K. Tezaur, O. Guba, W. F. Spotz, A. G.Salinger, R. P. Pawlowski, and M. A. Heroux, “Toward perfor-mance portability of the albany finite element analysis code usingthe kokkos library,”
The International Journal of High PerformanceComputing Applications , vol. 33, no. 2, pp. 332–352, 2019.[43] J. Eichst¨adt, M. Vymazal, D. Moxey, and J. Peir´o, “A comparisonof the shared-memory parallel programming models openmp,openacc and kokkos in the context of implicit solvers for high-order fem,”
Computer Physics Communications , vol. 255, p. 107245,2020.[44] J. A. Ellis and S. Rajamanickam, “Scalable inference for sparsedeep neural networks using kokkos kernels,” in . IEEE, 2019,pp. 1–7.[45] R. Gayatri, S. Moore, E. Weinberg, N. Lubbers, S. Anderson,J. Deslippe, D. Perez, and A. P. Thompson, “Rapid exploration of optimization strategies on advanced architectures using testsnapand lammps,” arXiv preprint arXiv:2011.12875 , 2020.[46] S. J. Pennycook, J. D. Sewall, and J. R. Hammond, “Evaluating theimpact of proposed openmp 5.0 features on performance, porta-bility and productivity,” in . IEEE,2018, pp. 37–46.[47] G. Womeldorff, J. Payne, and B. Bergen, “Taking lessons learnedfrom a proxy application to a full application for snap and par-tisn,”
Procedia Computer Science , vol. 108, pp. 555–565, 2017.[48] H. Kaiser, P. Diehl, A. S. Lemoine, B. A. Lelbach, P. Amini,A. Berge, J. Biddiscombe, S. R. Brandt, N. Gupta, T. Heller et al. ,“Hpx-the c++ standard library for parallelism and concurrency,”
Journal of Open Source Software , vol. 5, no. 53, p. 2352, 2020.[49] M. Martineau, S. McIntosh-Smith, M. Boulton, W. Gaudin, andD. Beckingsale, “A performance evaluation of kokkos & raja usingthe tealeaf mini-app,” in
The International Conference for High Per-formance Computing, Networking, Storage and Analysis, SC15 , 2015.[50] R. O. Kirk, G. R. Mudalige, I. Z. Reguly, S. A. Wright, M. J.Martineau, and S. A. Jarvis, “Achieving performance portabilityfor a heat conduction solver mini-application on modern multi-core systems,” in . IEEE, 2017, pp. 834–841. A UTHOR B IOGRAPHIES
Robert Bird is a Computational Scientist at LosAlamos National Laboratory, in the Applied Com-puter Science group. His research interests areperformance-portability and low-level code opti-mization. He is the computer science lead forthe VPIC project. Dr. Bird received his Ph.D inComputer Science from the University of War-wick, England.
Nigel Tan is a Ph.D. student in Computer Sci-ence under Dr. Michela Taufer at the Universityof Tennessee, Knoxville. He earned his B.S. inboth Computer Science and Applied Math at theUniversity of California Merced before earningan M.S. in Computational and Applied Math atRice University. Nigel’s research interests lie inhigh performance computing with an empha-sis on performance portability and optimizationacross multiple architectures.
Scott V. Luedtke is a Postdoc Research As-sociate at Los Alamos National Laboratory inthe Plasma Theory and Applications group. Hisresearch interests include high energy densityphysics, high intensity short-pulse laser-plasmainteractions, and large-scale physical simulation.Dr. Luedtke received his Ph.D. in physics fromthe University of Texas at Austin in 2020.
Stephen Lien Harrell is an Engineering Sci-entist at the Texas Advanced Computing Cen-ter in the HPC Performance and Architec-tures group. His research interests includeperformance portability, performance modelling,benchmarking and HPC metric capture. Beforehis current appointment Stephen worked as anHPC System Administrator and HPC SupportStaff for twelve years and received his bachelorsdegree in Computer Science at Purdue Univer-sity.
EBRUARY 17, 2021 13
Michela Taufer (Senior Member, IEEE) is anACM Distinguished Scientist and holds the JackDongarra professorship in high performancecomputing with the Department of ElectricalEngineering and Computer Science, Universityof Tennessee Knoxville. Her research interestsinclude high-performance computing, volunteercomputing, scientific applications, schedulingand reproducibility challenges, and in situ dataanalytics. Dr. Taufer received her Ph.D. in Com-puter Science from the Swiss Federal Institute ofTechnology (ETH) in 2002.