Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
Erik Zenker, René Widera, Axel Huebl, Guido Juckeland, Andreas Knüpfer, Wolfgang E. Nagel, Michael Bussmann
PPerformance-Portable Many-Core PlasmaSimulations: Porting PIConGPU to OpenPowerand Beyond (cid:63)
Erik Zenker , , Ren´e Widera , Axel Huebl , , Guido Juckeland ,Andreas Kn¨upfer , Wolfgang E. Nagel , Michael Bussmann Helmholtz-Zentrum Dresden–Rossendorf, Dresden, Germany Technische Universit¨at Dresden, Dresden, Germany { e.zenker,r.widera,a.huebl,g.juckeland,m.bussmann } @hzdr.de { andreas.knuepfer, wolfgang.nagel } @tu-dresden.de Abstract.
With the appearance of the heterogeneous platform Open-Power, many-core accelerator devices have been coupled with Power hostprocessors for the first time. Towards utilizing their full potential, it isworth investigating performance portable algorithms that allow to choosethe best-fitting hardware for each domain-specific compute task. Suitingeven the high level of parallelism on modern GPGPUs, our presented ap-proach relies heavily on abstract meta-programming techniques, whichare essential to focus on fine-grained tuning rather than code porting.With this in mind, the CUDA-based open-source plasma simulation codePIConGPU is currently being abstracted to support the heterogeneousOpenPower platform using our fast porting interface cupla, which wrapsthe abstract parallel C++11 kernel acceleration library Alpaka.We demonstrate how PIConGPU can benefit from the tunable kernelexecution strategies of the Alpaka library, achieving portability and per-formance with single-source kernels on conventional CPUs, Power8 CPUsand NVIDIA GPUs.
Keywords:
OpenPower, heterogeneous computing, HPC, C++11, CUDA,OpenMP, particle-in-cell, platform portability, performance portability
PIConGPU [2] is a fully-relativistic, multi-GPU, 3D3V particle-in-cell (PIC)code. As such it allows to model the mutual interaction between electromagneticfields and charged particles, including effects of retardation in special relativ-ity (SRT) and the collective motion of collisionless plasmas, by solving Maxwell’sequations self-consistently for charged particles and electromagnetic fields. Be-sides the satisfied demand for large scales and high resolutions by computing thewhole PIC cycle on GPUs, simulations of laser-ion acceleration from overdense (cid:63)
This project has received funding from the European Unions Horizon 2020 researchand innovation programme under grant agreement No 654220 a r X i v : . [ c s . D C ] J un argets [19] induce a further complexity in the dynamics of the plasma fromcollisional excitation and ionization processes. As the free electron density fromionization processes determines intrinsic observables such as the plasma wave-length, the modeling of underlying quantum processes needs to be taken intoaccount and is not yet covered in the plain electrodynamics provided by PIC.Our approach to enhance the PIC algorithm is therefore to add a Monte Carlostep in the simulation with 0-D atom physics from SCFLY [4]. This methodrequires to calculate the transition rate matrix, representing the likelihood ofchange of the atomic configuration of each ion from one time step to the next.Each of the quantum processes has its own individual models, calibrated withexperimental and theoretical estimates. Even when considering the reduction ofpossible transitions by using an effective number of states, removing physicallyforbidden and very unlikely transitions, the total number of transitions can growquadratically with the number of considered configurations. In combination withthe dependency of the transition matrix elements on local quantities, such as theenergy distribution of neighboring electrons and photons of each individual ionin the plasma, the required amount of memory can easily grow into the size ofseveral dozen gigabytes for a non-equilibrium system.None of the accelerators that are currently available or announced for the nearfuture fulfill these memory requirements. However, the accelerator’s host systemprovides access to fast and large main memories and file systems. The host’sCPUs are used as a first computing stage to reduce the full transition matrixto smaller lookup tables. CPUs excel at this task, since they typically providebetter performance on trigonometric functions and implicit solvers. Accordingly,only relevant data needs to be streamed to the GPU.The OpenPower platform couples various advanced hardware technologies onthe same system [11] such as Power CPUs, NVIDIA GPUs, and fast CPU–GPUinterconnect technology [7]. To fully utilize the compute power of this platform,it is currently necessary to use various programming models such as CUDA forGPU and OpenMP for CPU. However, this style of programming has the disad-vantage that the code is difficult to maintain and it requires more work to switchalgorithms between GPU and CPU implementations. A uniform programmingmodel allows to selectively determine the kernel execution hardware dependingon the algorithmic requirements. These requirements depend on the models ofthe individual physical process: some are memory bound, some compute bound,and the user chooses, based on domain knowledge and the relevance, on whichhardware these processes should be executed.Currently, widely utilized uniform parallel programming models such asOpenCL [17] do not fulfill all our requirements of a sustainable, open, main-tainable, testable, optimizable, and single-source programming model. Loop andcontainer based approaches such as RAJA [9], Kokkos [5], and OpenMP 4.0 [15]would require a complete redesign of the CUDA based PIConGPU code. With Alpaka [20], there exists an interface for parallel kernel acceleration which enablesthe programmer to compile single-source C++ kernels to various architectures,while providing all the requirements mentioned above. As a first step to selec-ive kernel acceleration on the OpenPower platform, PIConGPU has been portedwith the CUDA-like interface cupla [16] to Alpaka, which currently allows foran execution either on the CPU or on the GPU.This paper is structured as follows. In Section 2, we give a brief overviewon PIConGPU, Alpaka, and cupla. In Section 3, we provide our experienceson porting PIConGPU with cupla from CUDA to Alpaka. Finally, the portedprototype is evaluated on various architectures in Section 4.
PIConGPU is a multi-GPU particle-in-cell (PIC) code for three-dimensionalfield–particle interaction with high spatial resolution. The code decomposes itsglobal simulation domains into a grid of cells. Cells are grouped into a cuboidvolume called super cell , and multiple of these super cells are again grouped intoa cuboid volume which defines the local simulation domain of a single GPU.Additionally, there is a second, spatially continuous domain for finite sizemacro particles such as ions and electrons. They are able to move through thecells and interact with them, making PIC a particle mesh algorithm [3]. Macroparticles are grouped in frames, where each frame contains as many macro par-ticles as there are cells in a super cell. Frames are stored in a doubly linked listand correspond to a particular super cell.Most of the operations on the cells are local stencils which include only a fewneighboring cells and are therefore well suited to CUDA programming model ofa multidimensional grid. PIConGPU is mapped to this model as follows: Thelocal simulation domain is mapped to the grid of a single GPU. A super cell ismapped to a block that contains as many threads as there are cells — in oursimulation this amounts usually to 256 cells. A thread calculates the field of acell and its proportion of particles of its super cell.
Alpaka provides a uniform, abstract C++ interface to a range of parallel pro-gramming models. It can express multiple levels of parallelism and allows forgeneric programming of kernels either for a single accelerator device or a singleaddress space with multiple CPU cores. The Alpaka abstraction of paralleliza-tion is influenced by and based on the groundbreaking CUDA abstraction ofa multidimensional grid of blocks of threads. The four main execution hierar-chies introduced by Alpaka are called grid , block , thread , and element level. Theelement level denotes an amount of work a thread needs to process sequen-tially. These levels describe an index space which is called work division . Otherprogramming models call these levels differently e.g. OpenCL work-groups of work-items , OpenMP teams of threads , and OpenACC gang, worker, and vector .eparating parallelization abstraction from specific hardware capabilities al-lows for an explicit mapping of these levels to hardware. The current imple-mentation includes mappings to programming models, called back-ends, suchas OpenMP, CUDA, C++ threads, and boost fibers [14]. Nevertheless, mappingimplementations are not limited to these choices and can be extended or adaptedfor application-specific optimizations. Which back-end and work division to uti-lize is parameterized per kernel within the user code.A fast approach to port CUDA code to Alpaka is provided by the CUDA-like Alpaka interface cupla [ qχap (cid:48) la ?]. Cupla leaves most CUDA API calls un-changed, yet performs Alpaka calls in the background. Thus, cupla provides asimple and fast porting approach by reducing the number of lines of the originalCUDA code a programmer needs to modify. In this section we discuss the steps necessary to port the CUDA accelerated codeof PIConGPU from GPU to CPU hardware. Our approach is to replace CUDAby the CUDA-like interface cupla. Afterwards, we can utilize Alpaka’s CUDAand OpenMP 2.0 back-ends to execute our kernels on both GPUs and CPUs.Cupla leaves most parts of the CUDA code unchanged such as memory allo-cations, memory copies, stream handling, device handling, and index queries.The programmer is still required to handle three porting steps. Firstly, the cuda runtime.hpp include has to be replaced by cuda to cupla.hpp and all .cu files renamed to .cpp . Secondly, The host , device , and global keywords need to be replaced by equivalent cupla macros and CUDA globalfunctions rewritten into parenthesis operators of C++ functors. The acceleratorobject of the accelerator template type has to be passed to these operators andthe underlying device functions. Finally, each shared memory allocation has tobe replaced by an equivalent cupla macro. Listing 1 shows equivalent CUDA andcupla code snippets of a kernel function initializing an array by a constant value.In contrast to the CUDA kernel, each thread of the cupla kernel loops over the x dimension of the element level.The native PIC code consists of about forty thousand lines of code. Thiscode is a mixture C++11 and platform-dependent CUDA code. R. Widera pro-grammed about two days, applied the cupla porting steps mentioned above,touched most of our nine hundred device functions, forty kernels, amounting totwo thousand lines of code, to provide the first Alpaka based prototype. Althoughthis prototype did not utilize the element level, it was already executable on botha Power8 device using the OpenMP 2.0 back-end and on an NVIDIA device us-ing the CUDA back-end. The number of threads in a block was left unchanged.Accordingly, the domain of a super cell is processed by a block consisting of 256threads.This block-size leads to inefficient communication between threads on thePower8 when the the element level is omitted, resulting in more frequent cachemisses and a decrease in performance. Accordingly, the integration of the ele- // CUDA Kernel __global__ void kernel ( int * data ) { int id = blockDim .x * blockIdx .x + threadIdx .x; data [ id ] = 42; } // Alpaka Kernel struct void kernel { template < typename Acc > ALPAKA_FN_ACC void operator () ( Acc const & acc , int * data ) const { int id = blockDim .x * blockIdx .x * elemDim .x + threadIdx .x * elemDim .x; for ( int elem = 0; elem < elemDim .x; ++ elem ) data [ id + elem ] = 42 } }; Fig. 1.
CUDA and cupla kernels which initialize each element in the input array data by the value 42. The cupla kernel on the bottom was created through wrapping theCUDA kernel on the top within a C++ functor. Each thread of the cupla kernelprocesses multiple elements through looping over the dimensions of the additionalelement level. In the cupla kernel blockDim , blockIdx , threadIdx , and elemDim arepre-processor macros accessing the acc variable. ment level enables for a work division of blocks with a single thread and mul-tiple elements to calculate the entire domain of a super cell. This provides amore efficient mapping of Alpaka-threads to hardware threads and, therefore,an improved vectorization and cache utilization by the compiler. The integra-tion of the element level required to loop over the fixed-size element index spacefor each sequential kernel part. These sequential parts were wrapped in lambdafunctions. Furthermore, single element variables were expanded to multidimen-sional fixed-size arrays. This change, on three thousand lines of code, took ourdeveloper about ten days.To sum up, our developer modified about five thousand lines of code in amatter of two weeks, after which the entire forty thousand lines PIConGPUcode could be compiled and run efficiently on CPU and GPU devices. It was notnecessary to modify the core data structures or algorithms of PIConGPU. Theelement level has been added to enable a single thread to process the domainof a super cell. In the following section we will evaluate the performance of ourAlpaka-based PIC simulation on both architectures. able 1. Compute nodes for evaluation (core counts in braces are HW threads).
Vendor
AMD Intel IBM NVIDIA
Architecture
Interlagos [1] Haswell [10] Power8 [6] Kepler [12]
Model
Opteron 6276 Xeon E5-2698v3 Power8 8247-42L K80 GK210
Used devices per node
Cores per device
16 16 (32) 10 (80) 2496
Base clock frequency
Release date
Q4/2011 Q3/2014 Q1/2014 Q4/2014
Peak performance(sp)
960 GFLOPS 2354 GFLOPS 1120 GFLOPS 4350 GFLOPS
Peak performance(dp)
480 GFLOPS 1177 GFLOPS 560 GFLOPS 1450 GFLOPS
This section provides the evaluation of the PIConGPU code [18] that was portedto various compute architectures (see Table 1). We measured the runtime andperformance of the memory-bound PIC algorithm as implemented in PICon-GPU with a simulation of the Kelvin-Helmholtz instability [3] for one thousandtime steps in double and single precision and compared these results between thevarious architectures. The simulation was parameterized with the Boris pusher,Esirkepov current solver, Yee field solver, trilinear interpolation in field gather-ing, three spatial dimensions (3D3V), 128 cells in each dimension, electron andion species with each sixteen particles per cell, and quadratic-spline interpolation(TSC) [8]. On all CPU devices the OpenMP 2.0 back-end was used with a blockconsisting of a single thread with 256 elements. On NVIDIA GPUs the CUDAback-end is used with a block consisting of 256 threads with a single element.All GPU evaluations are compiled with nvcc atomicAdd using atomicCAS instead of the slower atomicExch used by thenative PIConGPU implementation. Nevertheless, this small optimization couldhave been introduced easily into the native PIConGPU code to achieve the same --use fast math --ftz=false -g0 -O3 -m64 -g0 -O3 -m64 -funroll-loops -march=native --param max-unroll-times=512-ffast-math Runtime and Floating Point E ffi ciency of the PIConGPU Kelvin Helmholtz Instability Simulation050010001500 NativeKepler AlpakaKepler AlpakaInterlagos AlpakaHaswell AlpakaPower8 r un t i m e [ s ] single precisiondouble precision 0%5%10%15% NativeKepler AlpakaKepler AlpakaInterlagos AlpakaHaswell AlpakaPower8 F P e ffi c i e n c y single precisiondouble precision Fig. 2.
As an example to evaluate a memory-bound PIC code, runtime and floatingpoint efficiency of the PIConGPU Kelvin-Helmholtz instability simulation for singleprecision and for double precision was measured on various architectures (see Table 1). untime results. According to these measurements, Alpaka can keep its promiseof zero-overhead abstraction on the same architecture even for rather complexapplications such as PIConGPU. The runtime between GPU and CPU imple-mentations differ in one order of magnitude for single precision. However, theresults need to be evaluated in relation to the theoretical peak performance ofthe particular architecture. This metric is denoted as floating point efficiency inFigure 2. Regarding floating point efficiency, CPU and GPU vary by a factorof three to four on single precision and by a factor of two on double precision.Thus, Alpaka provides not just portability between GPU and CPU, but decentperformance on both. All evaluated CPU architectures show similar runtimeand efficiency characteristics. Nevertheless, the Intel architecture offers the low-est runtime and highest (theoretical) peak performance of all evaluated CPUdevices. However, there still exists some potential to increase performance, asit only provides five percent floating point efficiency on double precision. Whilethe IBM and AMD architectures fare slightly better with about eight percentdouble precision efficiency, there is still a lot of potential compared to the GPUefficiency. By refining the Alpaka back-ends and tuning the work division, thispotential can be utilized to increase the performance of the CPU architectureseven more.
We have presented the current progress in porting the particle-in-cell simula-tion PIConGPU onto the OpenPower platform through utilizing the CUDA-likeAlpaka interface cupla . The core routines of the forty thousand lines mixedC++ and CUDA code have been ported from CUDA to Alpaka within twoweeks. Through this abstraction, the ported PIConGPU implementation is exe-cutable on AMD, IBM, Intel, and NVIDIA architectures. The code was not justported, but has been moved to a generic single-source multi-platform program-ming model. Thus, PIConGPU never needs to be ported again.The native CUDA version and the Alpaka version show no significant differ-ences in runtime or performance on the NVIDIA hardware, which demonstrateszero overhead abstraction capabilities of Alpaka. GPU and CPU devices dif-fer in a factor of about two in efficiency on double precision, providing decentperformance among the evaluated architectures.Future work will focus on the evaluation of each kernel on CPU and GPUhardware separately. Based on these measurements, we want to provide a staticmapping of kernels to heterogeneous hardware to achieve the best possible overallperformance on the particular HPC system. Furthermore, we want to completethe porting of the remaining simulation plugins within PIConGPU and add amore fine-grain element level implementation.The code is ready for the upcoming Power9 and NVIDIA Volta-based het-erogeneous systems such as Summit [13] at the Oak Ridge National Laboratory.By using Alpaka we have the possibility to optimize and adapt our back-ends tothese systems once they are fully specified and available for evaluation. eferences [1] AMD.
AMD Opteron 6200 Series Processor Quick Reference Guide . . [Online;accessed April 11, 2016].[2] Heiko Burau, Ren´e Widera, Wolfgang H¨onig, Guido Juckeland, AlexanderDebus, Thomas Kluge, Ulrich Schramm, Tomas E. Cowan, Roland Sauer-brey, and Michael Bussmann. “PIConGPU: A fully relativistic particle-in-cell code for a GPU cluster”. In: Plasma Science, IEEE Transactions on
Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis . ACM. 2013, p. 5.[4] H.-K. Chung, M. H. Chen, and R. W. Lee. “Extension of atomic configura-tion sets of the Non-LTE model in the application to the K α diagnostics ofhot dense matter”. In: High Energy Density Physics
Journal of Parallel and Distributed Computing
Solid-StateCircuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE In-ternational . IEEE. 2014, pp. 96–97.[7] Denis Foley. “NVLink, Pascal and Stacked Memory: Feeding the Appetitefor Big Data”. In:
Nvidia.com (2014).[8] Roger W Hockney and James W Eastwood.
Computer simulation usingparticles . CRC Press, 1988.[9] R Hornung and J Keasler. “The RAJA portability layer: overview andstatus”. In:
Lawrence Livermore National Laboratory, Livermore, USA (2014).[10] Intel.
Intel Xeon Processor E5-2698 v3 Specification . http://ark.intel.com/de/products/81060/Intel- Xeon- Processor- E5- 2698- v3- 40M-Cache-2_30-GHz . [Online; accessed April 11, 2016].[11] Mauricio Faria de Oliveira. NVIDIA CUDA on IBM POWER8: Technicaloverview, software installation, and application development .[12] NVIDIA.
Tesla K80 GPU Accelerator Board Specification . http : / /images . nvidia . com / content / pdf / kepler / Tesla - K80 - BoardSpec -07317-001-v05.pdf . [Online; accessed April 11, 2016].13] Oak Ridge National Laboratory. Summit. Scale new heights. Discover newsolutions. Oak Ridge National Laboratory’s next High Performance Su-percomputer . . [Online; accessedApril 10, 2016].[14] Oliver Kowalke. boost.fiber . https : / / github . com / olk / boost - fiber .[Online; accessed April 12, 2016].[15] OpenMP. OpenMP application program interface version 4.0 . 2013.[16] Ren´e Widera. cupla - C++ User interface for the Platform independent Li-brary Alpaka . https://github.com/ComputationalRadiationPhysics/cupla . [Online; accessed March 14, 2016].[17] John E. Stone, David Gohara, and Guochun Shi. “OpenCL: A parallel pro-gramming standard for heterogeneous computing systems”. In: Computingin science & engineering
PI-ConGPU, Alpaka, and cupla software bundle for IWOPH 2016 submission .May 2016. doi : . url : http://dx.doi.org/10.5281/zenodo.53761 .[19] Karl Zeil, Josefine Metzkes, Thomas Kluge, Michael Bussmann, ThomasE. Cowan, Stephan. D. Kraft, Roland Sauerbrey, and Ulrich Schramm.“Direct observation of prompt pre-thermal laser ion sheath acceleration”.In: Nature communications