[PDF] K-Athena: a performance portable structured grid finite volume magnetohydrodynamics code

Abstract

Large scale simulations are a key pillar of modern research and require ever-increasing computational resources. Different novel manycore architectures have emerged in recent years on the way towards the exascale era. Performance portability is required to prevent repeated non-trivial refactoring of a code for different architectures. We combine Athena++, an existing magnetohydrodynamics (MHD) CPU code, with Kokkos, a performance portable on-node parallel programming paradigm, into K-Athena to allow efficient simulations on multiple architectures using a single codebase. We present profiling and scaling results for different platforms including Intel Skylake CPUs, Intel Xeon Phis, and NVIDIA GPUs. K-Athena achieves > 10 8 cell-updates/s on a single V100 GPU for second-order double precision MHD calculations, and a speedup of 30 on up to 24,576 GPUs on Summit (compared to 172,032 CPU cores), reaching 1.94× 10 12 total cell-updates/s at 76% parallel efficiency. Using a roofline analysis we demonstrate that the overall performance is currently limited by DRAM bandwidth and calculate a performance portability metric of 62.8%. Finally, we present the implementation strategies used and the challenges encountered in maximizing performance. This will provide other research groups with a straightforward approach to prepare their own codes for the exascale era. K-Athena is available at this https URL .

Full PDF

GGRETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 1

K - A

T H E N A : a performance portable structuredgrid ﬁnite volume magnetohydrodynamics code

Philipp Grete, Forrest W. Glines, and Brian W. O’Shea

Abstract —Large scale simulations are a key pillar of modern research and require ever increasing computational resources. Differentnovel manycore architectures have emerged in recent years on the way towards the exascale era. Performance portability is requiredto prevent repeated non-trivial refactoring of a code for different architectures. We combine A

THENA ++, an existingmagnetohydrodynamics (MHD) CPU code, with K

OKKOS , a performance portable on-node parallel programming paradigm, intoK-A

THENA to allow efﬁcient simulations on multiple architectures using a single codebase. We present proﬁling and scaling results fordifferent platforms including Intel Skylake CPUs, Intel Xeon Phis, and NVIDIA GPUs. K-A

THENA achieves > cell-updates/s on asingle V100 GPU for second-order double precision MHD calculations, and a speedup of 30 on up to 24,576 GPUs on Summit(compared to 172,032 CPU cores), reaching . × total cell-updates/s at parallel efﬁciency. Using a rooﬂine analysis wedemonstrate that the overall performance is currently limited by DRAM bandwidth and calculate a performance portability metric of83.1%. Finally, we present the implementation strategies used and the challenges encountered in maximizing performance. This willprovide other research groups with a straightforward approach to prepare their own codes for the exascale era. K-A THENA is availableat https://gitlab.com/pgrete/kathena.

Index Terms — (cid:70) NTRODUCTION T HE era of exascale computing is approaching. Differentprojects around the globe are working on the ﬁrstexascale supercomputers, i.e., supercomputers capable ofconducting ﬂoating point operations per second. Thisincludes, for example, the Exascale Computing Initiativeworking with Intel and Cray on Aurora as the ﬁrst exascalecomputer in the US in 2021, the EuroHPC collaborationworking on building two exascale systems in Europe by2022/2023, Fujitsu and RIKEN in Japan working on thePost-K machine to launch in 2021/2022, and China whotarget 2020 for their ﬁrst exascale machine. While the exactarchitectural details of these machines are not announcedyet and/or are still under active development, the overalltrend in recent years has been manycore architectures. Here,manycore refers to an increasing number of (potentiallysimpler) cores on a single compute node and includesCPUs (e.g., Intel’s Xeon Scalable Processor family or AMD’sEpyc family), accelerators (e.g., the now discontinued IntelXeon Phi line), and GPUs for general purpose computing.MPI+O PEN

MP has been the prevailing parallel program-ming paradigm in many areas of high performance com-puting for roughly two decades. It is questionable, however,whether this generic approach will be capable of makingefﬁcient use of available hardware features such as parallelthreads and vectorization across different manycore archi-tectures and between nodes. • P. Grete, F. W. Glines, and B. .W. O’Shea are with the Department ofPhysics and Astronomy, Michigan State University, East Lansing, MI48824 USA, also with the Department of Computational Mathematics,Science and Engineering, Michigan State University, East Lansing, MI48824 USA.E-mail: [email protected] • B. W. O’Shea is also with the National Superconducting CyclotronLaboratory, Michigan State University, East Lansing, MI 48824 USA.Manuscript received . . . ; revised . . . .

In addition to extensions of the MPI standard such asshared-memory parallelism, several approaches in additionto MPI+O

PEN

MP exist and are being actively developedto address either on-node, inter-node, or both types of par-allelism. These include, for example, partitioned global ad-dress space (PGAS) programming models such as UPC++[1], or parallel programming frameworks such as C

HARM ++or L

EGION , which are based on message-driven migratableobjects [2], [3].Our main goal is a performance portable version ofthe existing MPI+O

PEN

MP ﬁnite volume (general relativ-ity) magnetohydrodynamics (MHD) code A

THENA ++ [4],[5]. This goal includes enabling GPU-accelerated simula-tions while maintaining CPU performance using a singlecode base. More generally, performance portability refersto achieving consistent levels of performance across het-erogeneous platforms using as little architecture-dependentcode as possible [6]. In order to keep the code changesminimal and given the MPI+O

PEN

MP basis of A

THENA ++we decided to keep MPI for inter-node parallelism andfocus on on-node performance portability.For on-node performance portability several librariesand programming language extensions exist. With version4.5 O

PEN

MP [7] has been extended to support ofﬂoadingto devices such as GPUs, but support and maturity is stillhighly compiler and architecture dependent. This similarlyapplies to O

PEN

ACC, which has been designed from thebeginning to target heterogeneous platforms. While thesetwo directives-based programming models are generallyless intrusive with respect to the code base, they only ex-pose a limited fraction of various platform speciﬁc features.O

PEN

CL [8] is much more ﬂexible and allows ﬁne grainedcontrol over hardware features (e.g., threads), but this, onthe other hand, adds substantial complexity to the code.K

OKKOS [9] and RAJA [10] try to combine the strength of a r X i v : . [ c s . D C ] M a y RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 2 ﬂexibility with ease of use by providing abstractions in theform of C++ templates. Both K

OKKOS and RAJA focus onabstractions of parallel regions in the code, and K

OKKOS additionally provides abstractions of the memory hierarchy.At compile time the templates are translated to different(native) backends, e.g., O

PEN

MP on CPUs or CUDA onNVIDIA GPUs.We chose K

OKKOS for the refactoring of A

THENA ++for several reasons. K

OKKOS offers the highest level ofabstraction without forcing the developer to use it by set-ting reasonable implicit platform defaults. Moreover, theK

OKKOS core developer team actively works on integrat-ing the programming model into the C++ standard. New,upcoming features, e.g., in O

PEN

MP, will replace manualimplementations in the K

OKKOS O PEN

MP backend overtime. K

OKKOS is already used in several large projects toachieve performance portability, e.g., the scientiﬁc softwarebuilding block collection Trilinos [11] or the computationalframework for simulating chemical and physical reactionsUintah [12]. In addition, K

OKKOS is part of the DOE’sExascale Computing Project and we thus expect a backendfor Aurora’s new Intel Xe architecture when the systemlaunches. Finally, the K

OKKOS community, including coredevelopers and users, is very active and supportive with re-spect to handling issues, questions and offering workshops.The resulting K-A

THENA code successfully achieves per-formance portability across CPUs (Intel, AMD, and IBM),Intel Xeon Phis, and NVIDIA GPUs. We demonstrate weakscaling at 76% parallel efﬁciency on 24,576 GPUs on OLCF’sSummit, reaching . × total cell-updates/s for adouble precision MHD calculation. Moreover, we calculatea performance portability metric of 83.1% across Xeon Phis,5 CPU generations, and 4 GPU generations. We make thecode available as an open source project .The paper is organized as follows. In Section 2 weintroduce K OKKOS , A

THENA ++, and the changes madeand approach chosen in creating K-A

THENA . In Section 3we present proﬁling, scaling and rooﬂine analysis results.Finally, we discuss current limitations and future enhance-ments in Sec. 4 and make concluding remarks in Sec. 5.

ETHOD

OKKOS K OKKOS is an open source C++ performance portabilityprogramming model [9]. It is implemented as a templatelibrary and offers abstractions for parallel execution of codeand data management. The core of the programming modelconsists of six abstractions.First, execution spaces deﬁne where code is executed. Thisincludes, for example, O

PEN

MP on CPUs or Intel Xeon Phis,CUDA on NVIDIA GPUs, or

ROCm on AMD GPUs (cur-rently experimental). Second, execution patterns are parallelpatterns, e.g. parallel_for or parallel_reduce , andare the building blocks of any application that uses K OKKOS .These parallel regions are often also referred to as kernels asthey can be dispatched for execution on execution spaces

1. K-A

THENA ’s project repository is located at https://gitlab.com/pgrete/kathena.2. See https://github.com/kokkos for the library itself, associatedtools, tutorial and a wiki. (such as GPUs). Third, execution policies determine howan execution pattern is executed. There exist simple rangepolicies that only specify the indices of the parallel patternand the order of iteration (i.e., the fastest changing indexfor multidimensional arrays). More complicated policies,such as team policies, can be used for more ﬁne-grainedcontrol over individual threads and nested parallelism.Fourth, memory spaces specify where data is located, e.g.,in host/system memory or in device space such as GPUmemory. Fifth, the memory layout determines the logicalmapping of multidimensional indices to actual memorylocation, cf., C family row-major order versus Fortran column-major order. Sixth, memory traits can be assigned todata and specify how data is accessed, e.g., atomic access,random access, or streaming access.These six abstractions offer substantial ﬂexibility in ﬁne-tuning application, but the application developer is not al-ways required to specify all details. In general, architecture-dependent defaults are set at compile time based on theinformation on devices and architecture provided. For ex-ample, if CUDA is deﬁned as the default execution spaceat compile time, all

Kokkos::View s, which are the funda-mental multidimensional array structure, will be allocatedin GPU memory. Moreover, the memory layout is set tocolumn-major so that consecutive threads in the same warpaccess consecutive entries in memory.

THENA ++ A THENA ++ is a radiation general relativistic magnetohy-drodynamics (GRMHD) code focusing on astrophysical ap-plications [4], [5]. It is a rewrite in modern C++ of thewidely used A

THENA C version [13]. A THENA ++ offers awide variety of compressible hydro- and magnetohydro-dynamics solvers including support for special and rela-tivistic (M)HD, ﬂexible geometries (Cartesian, cylindrical,or spherical), and mixed parallelization with O

PEN

MP andMPI. Apart from the overall feature set, the main reasonswe chose A

THENA ++ are a) its excellent performance onCPUs and KNLs due to a focus on vectorization in the codedesign, b) a generally well written and documented codebase in modern C++, c) point releases are publicly availablethat contain many (but not all) features , and d) a ﬂexibletask-based execution model that allows for a high degree ofmodularity.A THENA ++’s parallelization strategy evolves around so-called meshblocks . The entire simulation grid is dividedinto smaller meshblocks that are distributed among MPIprocesses and/or O

PEN

MP threads. Each MPI processes (orO

PEN

MP thread) owns one or more meshblocks that canbe updated independently after boundary information havebeen communicated. If hybrid parallelization is used, eachMPI process runs one or more O

PEN

MP threads that eachare assigned one or more meshblock . This design choice isoften referred to as coarse-grained parallelization as threadsare used at a block (here meshblock ) level and not overloop indices. In general, A

THENA ++ uses persistent MPI

3. Our code changes are based on the public version,A

THENA ++ 1.1.1, see https://github.com/PrincetonUniversity/athena-public-version

RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 3 communication handles to realize asynchronous communi-cation. Moreover, each thread makes its own MPI calls toexchange boundary information. As a result, using morethan one thread per MPI process may increase overall on-node performance due to hyperthreading but also increasesboth the number of MPI messages sent and the total amountof data sent. The latter may result in overall worse parallelperformance and efﬁciency, as demonstrated in Sec 3.3.2.

Listing 1. Example triple for loop for a typical operation in a ﬁnitevolume method on a structured mesh such as in a code like A

THENA ++,where ks , ke , js , je , is , and ie are loop bounds and u is an athena_array object of, for example, an MHD variable. for( int k = ks; k < ke; k++){for( int j = js; j < je; j++){ /* Loop Body */ u(k,j,i) = ...}}} Given the coarse-grained O

PEN

MP approach over meshblocks the prevalent structures in the code base aretriple (or quadruple) nested for loops that iterate over thecontent of each meshblock (and variables in the quadruplecase). A prototypical nested loop is illustrated in Listing 1.Generally, all loops (or kernels) in A

THENA ++ have beenwritten so that O

PEN MP simd pragmas are used for theinnermost loop. This helps the compiler in trying to auto-matically vectorize the loops resulting in a more performantapplication. THENA = K

OKKOS + A

THENA ++ In order to combine A

THENA ++ and K

OKKOS , four ma-jor changes in the code base were required: 1) making

Kokkos::View s the fundamental data structure, 2) con-verting nested for loop structures to kernels, 3) converting“support” functions, such as the equation of state, to inlinefunctions, and 4) converting communication buffer ﬁllingfunctions into kernels.First,

View s are the K

OKKOS ’ abstraction of multidi-mensional arrays. Thus, the multidimensional arrays orig-inally used in A

THENA ++, e.g., the MHD variables foreach meshblock , need to be converted to

View s so thatthese arrays can transparently be allocated in arbitrarymemory spaces such as device (e.g., GPU) memory orsystem memory. A

THENA ++ already implemented an ab-stract athena_array class for all multidimensional ar-rays with an interface similar to the interface of a

View .Therefore, we only had to add

View objects as membervariables and to modify the functions of athena_array sto transparently use functions of those member

View s. Thisincluded using

View constructors to allocate memory, us-ing

Kokkos::deep_copy or Kokkos::subview for copyconstructors and shallow slices, and creating public memberfunctions to access the

View s. The latter is required in orderto properly access the data from within compute kernels.Second, all nested for loop structures (see Listing 1 needto be converted to so-called kernels, i.e., parallel region thatcan be dispatched for execution by an execution space. Asdescribed in Sec. 2.1 multiple execution policies are possible, such as a multidimensional range policy (see Listing 2) ora team policy that allows for more ﬁne-grained control andnested parallelism (see Listing 3).

Listing 2. Example for loop using K

OKKOS . The loop bodyis reformulated into a lambda function and passed into

Kokkos::parallel_for to execute on the target architecture.The class

Kokkos::MDRangePolicy speciﬁes the loop bounds. Thearray u is now a Kokkos::View , a K

OKKOS building block that allowstransparent access to CPU and GPU memory. The loop body, i.e., themajority of the code, remains mostly unchanged. using namespace Kokkos;parallel_for( MDRangePolicy>({ks,js,is},{ke,je,ie}),KOKKOS_LAMBDA(int k, int j, int i){ /* Loop Body */ u(k,j,i) = ...});

Listing 3. Another approach using K

OKKOS ’ nested team-basedparallelism through the

Kokkos::TeamThreadRange and

Kokkos::ThreadVectorRange classes. This interface is closerto the underlying parallelism used by the backend such as CUDAblocks on GPUs and SIMD vectors on CPUs. Again, the loop bodyremains mostly unchanged and u is now a Kokkos::View . using namespace Kokkos;parallel_for(team_policy(nk, AUTO),KOKKOS_LAMBDA(member_type thread) {const int k = thread.league_rank() + ks;parallel_for(TeamThreadRange<>(thread,js,je,[&] (const int j) {parallel_for(ThreadVectorRange<>(thread,is,ie,[=] (const int i) { /* Loop Body */ u(k,j,i) = ...});});}); Generally, the loop body remained mostly unchanged.Given that it is not a priori clear what kind of executionpolicy yields the best performance for a given implementa-tion of an algorithm, we decided to implement a ﬂexibleloop macro. That macro allows us to easily change theexecution policy for performance tests – see proﬁling resultsin Sec. 3.3.1 and discussion in Sec. 4.Third, all functions that are called within a kernel needto be converted into inline functions (here, more speciﬁcallyusing the

KOKKOS_INLINE_FUNCTION macro). This is re-quired because if the kernels are executed on a device suchas a GPU, the function need to be compiled for the device(e.g., with a __device__ attribute when compiling withCUDA). In A

THENA ++, this primarily concerned functionssuch as the equation of state and coordinate system-relatedfunctions.Fourth, A

THENA ++ uses persistent communicationbuffers (and MPI handles) to exchange data between pro-cesses. Originally, these buffers resided in the system mem-ory and were ﬁlled directly from arrays residing in thesystem memory. In the case where a device (such as a GPU)is used as the primary execution space and the arrays shouldremain on the device to reduce data transfers, the buffer

RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 4 ﬁlling functions need to be converted too. Thus, we changedall buffers to be

View s and converted the buffer ﬁllingfunctions into kernels that can be executed on any executionspace. In addition, this allows for CUDA-aware MPI– GPUbuffers can be directly copied between the memories ofGPUs (both on the same node and on different nodes)without an implicit or explicit copy of the data to systemmemory.In general, the ﬁrst three changes above are requiredin refactoring any legacy code to make use of K

OKKOS .The fourth change was required for A

THENA ++ due to theexisting MPI communication patterns.Finally, for the purpose of the initial proof-of-concept,we only refactored the parts required for running hydrody-namic and magnetohydrodynamic simulations on static andadaptive Cartesian meshes. Running special and general rel-ativistic simulations on spherical or cylindrical coordinatesis currently not supported. However, the changes requiredto allow for these kind of simulations are straightforwardand we encourage and support contributions to re-enablethis functionality.

Overall, the development process was driven and accom-panied by several key decisions. First, we wanted to beable to continuously track potential overhead and changesin performance. Therefore, we added so-called K

OKKOS proﬁling regions to the original codebase covering all rel-evant parts of the code, e.g., the main loop, communication,reconstruction, and the Riemann solve. This allowed us toobtain detailed timings (across MPI processes) with theK

OKKOS proﬁling tools of all parts of the code includingthose that were not converted to make use of K

OKKOS yet.In addition to manual regions, all kernels are automaticallyproﬁled once a parallel execution pattern is used.In order to ensure that the code keeps producing correctresults we employ automated regression testing using Git-Lab’s continuous integration features. A

THENA ++ alreadyprovides a regression test suite cover many aspect of thecode. For K-A

THENA we extended this suite to speciﬁcallyaddress changes related to K

OKKOS . For example, we testdifferent execution policies for the kernels, K

OKKOS ’ par-tition master for multi-threaded operation, and run testsboth on CPUs and GPUs. The tests are automated andare triggered for every Git merge request. Moreover, werequired that any merge request that added functionalityor converted parts of the code to make use of K OKKOS needs to be accompanied by a new test case addressing thatchange. NVIDIA’s uniﬁed memory allowed us to run testson GPUs without having converted the entire code. Withuniﬁed memory a virtual memory space is available thatcan be accessed from both CPU and GPU. Page migrationsoccur automatically and transparently, i.e., no manual datamovement is required.In general, uniﬁed memory eased our refactoring ex-perience. Originally, we used

Kokkos::DualView s in thetransition.

DualView s store data in two different memoryspaces, e.g., on the system memory and GPU memory. How-ever, synchronization between the memory spaces requiresmanually specifying which memory space is needed and when they have been modiﬁed at different points in thealgorithm. This introduced additional code overhead. Forthis reason, we decided to discontinue using

DualView sand instead used normal

View s with default allocations inuniﬁed memory space. In addition, being able to changethe default memory space from uniﬁed memory to pureGPU memory at compile time allowed us to easily identifyall parts of the code that need to be converted by trackingsegmentation faults with a debugger.

ESULTS

If not noted otherwise, all results in this section have beenobtained using a double precision, shock-capturing, unsplit,adiabatic MHD solver consisting of Van Leer integration,piecewise linear reconstruction, Roe Riemann solver, andconstrained transport for the integration of the inductionequation (see, e.g., [14] for more details). The test problemis a linear fast magnetosonic wave on a static, structured,three-dimensional grid. Uniﬁed memory in GPU runs wasnot used, i.e., there was no data transfer between system andGPU memory after the problem initialization. Generally,we used the Intel compilers on Intel platforms, and gcc and nvcc on other platforms as we found that (recent)Intel compilers are more effective in automatic vectorizationthan (recent) gcc compilers. We used the identical softwareenvironment and compiler ﬂags for both K-A

THENA andA

THENA ++ where possible. Details are listed in Table 1.We used A

THENA ++ version 1.1.1 (commit ) andK-A

THENA commit for the scaling tests. Addi-tional information on how to run K-A

THENA on differentmachines can be found in the documentation.

In order to evaluate the effect on performance of the dif-ferent loop structures presented in Sec. 2.3 we compare thetimings of different regions within the main loop of the code.The results using both an NVIDIA V100 GPU and an IntelSkylake CPU for a selection of the computationally mostexpensive regions are shown in Fig. 1. The loopstructure refers to a one dimensional range policy over asingle index that is explicitly unpacked to the multidimen-sional indices in the code. While this is the fastestloop structure for all regions on the GPU, it is the slowest forall regions on the CPU. We suspect that this is related to theinability of the compiler to vectorize the loops on CPUs. Allother loop structures tested, i.e., simd-for , MDRange , and

TeamPolicy logically separate the nested loops and, thus,make it easier for the compiler to automatically vectorizethe innermost loop. This also explains why the results for simd-for , MDRange , and

TeamPolicy are very close toeach other for all regions except the Riemann solver. Giventhat the Riemann solve is the most complex loop/kernel, weagain suspect that the compiler is unable to vectorize thatloop if K

OKKOS templates add additional complexity to theloop structure. We expect that these differences will becomeless pronounced with continued compiler development.On the GPU,

MDRange is the slowest loop structure,being several times slower than the across allregions.

TeamPolicy is on par with for half of

RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 5

TABLE 1Software Environment and Compiler Flags Used In Scaling Tests.

Machine Compiler Compiler ﬂags MPI version

Summit GPU GCC 6.4.0 & Cuda 9.2.148 -O3 -std=c++11 -fopenmp -Xcudafe--diag_suppress=esa_on_defaulted_function_ignored-expt-extended-lambda -arch=sm_70 -Xcompiler

Spectrum MPI 10.2.0.11Summit CPU GCC 8.1.1 -O3 -std=c++11 -fopenmp-simd -fwhole-program-flto -ffast-math -fprefetch-loop-arrays-fopenmp -mcpu=power9 -mtune=power9

Spectrum MPI 10.2.0.11Titan GPU GCC 6.3.0 & Cuda 9.1.85 -O3 -std=c++11 -fopenmp -Xcudafe--diag_suppress=esa_on_defaulted_function_ignored-expt-extended-lambda -arch=sm_35 -Xcompiler

Cray MPICH 7.6.3Titan CPU GCC 6.3.0 -O3 -std=c++11 -fopenmp

Cray MPICH 7.6.3Theta ICC 18.0.0 -O3 -std=c++11 -ipo -xMIC-AVX512-inline-forceinline -qopenmp-simd -qopenmp

Cray MPICH 7.7.3Electra ICC 18.0.3 -O3 -std=c++11 -ipo -inline-forceinline-qopenmp-simd -qopt-prefetch=4 -qopenmp-xCORE-AVX512

HPE MPT 2.17 R i e m a nn C o m p u t e C o r n e r E R e c o n s t r u c t Z R e c o n s t r u c t X F l u x D i v e r g e n c e C T W e i g h t e d A v e U w a ll t i m e r e l a t i v e t o R i e m a nn NVidia Volta V100 GPU R i e m a nn C o m p u t e C o r n e r E R e c o n s t r u c t Z R e c o n s t r u c t X F l u x D i v e r g e n c e C T W e i g h t e d A v e U Intel Xeon Gold 6148 Skylake CPU

Fig. 1. Proﬁling results on a GPU (left) and CPU (right) for selected regions (x-axis) within the main loop of an MHD timestep using the algorithmdescribed in Sec. 3. The different lines correspond to different loop structures, see Sec. 2.3 and the timings are normalized to the fastest Riemannregion in each panel. the regions shown. As discussed in more detail in Sec. 4,we expected these non-optimized raw loop structures to notcause any major differences in performance.The results shown here for V100 GPUs and SkylakeCPUs equally apply to other GPU generations and otherCPUs (and Xeon Phis), respectively. For all tests conductedin the following, we use the loop structure with the highestperformance on each architecture, i.e., on GPUsand simd-for on CPUs and Xeon Phis.

Our main objective for writing K-A

THENA is an MHD codethat runs efﬁciently on any current supercomputer andpossibly any future machines. A code that runs efﬁcientlyon more architectures is said to be performance portable.Determining what is meant by ”efﬁcient code” can be vague,especially when comparing performance across differentarchitectures. The speciﬁc memory space sizes, bandwidths, instruction sets, and groupings of cores on different archi-tectures can all affect how efﬁciently a code can utilize thehardware.In order to make fair comparisons of K-A

THENA ’s per-formance on different machines, we used the rooﬂine model[15] to quantify the application performance of K-A

THENA ,or the fraction of the performance obtained out of the max-imum theoretical performance as limited by the arithmeticintensity of the code and the bandwidth and throughputof the processor. With the rooﬂine model, we can takeinto account the different hardware conﬁgurations as wemeasure K-A

THENA ’s performance.For quantifying the performance portability from theapplication performance on each machine, we used the met-ric developed by [16]. This performance portability metricemphasizes high efﬁciency on all machines as opposed toonly some machines.

RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 6

The rooﬂine model plot is a tool to determine the theoreticalpeak performance of a code or algorithm on a speciﬁcarchitecture, given the bandwidths of the different memoryspaces and the arithmetic intensity, or number of operationsvs memory usage of the algorithm for the different spaces.The plot is a condensation of the performance limits im-posed by the bandwidth of each memory space. For anygiven architecture, the maximum obtainable ﬂoating pointoperations per second (FLOP/s) is limited by P max ≤ min m ∈ M { min ( T Peak , B m × I m ) } , (1)where P max [FLOP/s] is the maximum possible FLOP/s, T Peak [FLOP/s] is the peak throughput, or the maximumnumber of FLOP/s the device can achieve, M is all thememory spaces on the device (L1 cache, L2 cache, DRAM,etc.), B m [Byte/s] is the bandwidth of the memory space m ,and I m [FLOP/Byte] is the arithmetic intensity of speciﬁcalgorithm or code for the memory space m , or the numberof FLOP executed per number of bytes written and read toand from m .In the rooﬂine model, throughputs and bandwidths areplotted on a log FLOP/s versus log FLOP/Byte scale, so thatthroughputs are horizontal lines and bandwidths as P ∝ I lines (since bandwidth-limited P = B m × I ). The arithmeticintensities of each memory space for a speciﬁc algorithmappear as vertical lines, extending up to their bandwidth-limited performance, P m, max = min ( T Peak , B m × I m ) . Themaximum achievable performance of a given algorithm orcode is the minimum of these bandwidth-limited perfor-mances, although optimizations of the code can change thearithmetic intensities of the different memory spaces. Wecan also mark the actual performance of K-A THENA witha horizontal dashed line, indicating the average FLOP/sachieved. Figures 2a and 2b show rooﬂine modes for anIntel Skylake CPU on NASA’s Electra and NVIDIA VoltaV100 GPU on ORNL’s Summit, respectively.Although numbers for different bandwidths andthroughputs can be obtained through vendor speciﬁcationsand theoretical arithmetic intensities can be computed byhand, empirical testing is required to more accurately reﬂectthe performance of actual hardware and code. This requireda variety of different performance proﬁling tools on the dif-ferent architectures and machines, depending on what wasavailable. For gathering the bandwidths and throughputon both CPUs and GPUs, we used the Empirical RooﬂineToolkit [17]. However, the publicly available version of theEmpirical Rooﬂine Toolkit does not measure L1 bandwidthfor GPUs, for which we used GPUM

EMBENCH [18]. Forcomputing arithmetic intensities on GPUs, we were ableto use NVIDIA’s

NVPROF to measure both the total FLOPper ﬁnite volume cell and total reads and writes from theL1 cache, L2 cache, and DRAM/HBM per cell. For IntelCPUs, we used Intel’s Software Development Emulator(SDE) [19] to capture the total FLOP per cell and L1 readsand writes. To measure the reads and writes from L2 andL3 on Intel CPUs, we used LIKWID [20]. For reads andwrites to DRAM, we used Intel’s VT

UNE [21]. With thedata combined from SDE, LIKWID, and VT

UNE , we could compute the arithmetic intensities for K-A

THENA runningon Intel CPUs.We did ﬁnd that the L1 memory usage reported byLIKWID was about half of what SDE reported, when weexpected the difference would be a few percent. We attributethis to restricted permissions (as non- root user) to model-speciﬁc registers that are used for performance proﬁling.Given that we obtained all L2 arithmetic intensities withLIKWID these numbers must be interpreted with care.However, our primarily focus is on performance relative tothe DRAM performance.In all cases, we found that K-A

THENA ’s performance islimited by the smallest memory space that will accommo-date the data required by a single MPI task. For GPUs, this ison device DRAM/HBM, for CPUs this is the DDR3/DDR4DRAM, and for KNLs this was the MCDRAM. This isexpected, due to the current construction of K-A

THENA ; theﬁnite volume MHD method is currently implemented as aseries of simple kernels for different tasks such as recon-struction or ﬂux calculations. Each kernel is a tightly nestedtriple or quadruple for-loop. Within each iteration, data isloaded from DRAM up to the registers, then discarded. Fu-ture improvements can be made to K-A

THENA to explicitlycache data in smaller 1D arrays and kept in higher levelcaches. This would raise the DRAM arithmetic intensitywhile lowering the caches’ arithmetic intensity, but overallraising the performance ceiling. Similar improvements havealready been implemented upstream in A

THENA ++. A morecomplete solution would involve fusing consecutive kernelsinto one kernel to reduce DRAM accesses.

Performance portability is at present nebulously deﬁned. Itis generally held that a performance portable applicationcan execute wide variety of architectures and achieve ac-ceptable performance, preferably maintaining a single codebase for all architectures. In order to make valid compar-isons between codes, an objective metric of performanceportability is needed.The metric proposed by [16] quantiﬁes performanceportability by the harmonic sum of the performanceachieved on each platform, so that P ( a, p, H ) =  | H | (cid:80) i ∈ H e i ( a,p ) if i is supported ∀ i ∈ H otherwise (2)where a is an application, p is a problem to be solved bythe application, H is the space of all relevant platforms,and e i ( a, p ) is the performance efﬁciency of application a to solve the problem p on a platform i . If an applicationdoes not support a platform, then it is not performanceportable across the platforms and so is assigned a metric of0. The performance efﬁciency can also be deﬁned in severalways, such as the application efﬁciency and architecturalefﬁciency, as used in [16]. Application efﬁciency is theachieved performance as a fraction of the performance ofthe fastest application that can solve the problem on the plat-form. Architectural efﬁciency is the achieved fraction of thetheoretical peak performance of the hardware, accountingfor bandwidths, throughputs, and arithmetic intensities. We RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 7 Arithmetic Intensity (Flop/Byte)10 T F l o p / s D R A M ( G B / s ) L ( G B / s ) L ( G B / s ) Peak (1.50 TFlop/s)K-Athena (0.11 TFlop/s) (a) CPU Rooﬂine Arithmetic Intensity (Flop/Byte)10 T F l o p / s D R A M ( G B / s ) L ( G B / s ) L ( G B / s ) Peak (7.83 TFlop/s)K-Athena (0.61 TFlop/s) (b) GPU RooﬂineFig. 2. Rooﬂine models of a 2 socket Intel Xeon Gold 6148 ”Skylake”CPU node on NASA’s Electra (2a) and a single NVIDIA Tesla V100”Volta” GPU on ORNL’s Summit (2b). For both cases shown here andall other architectures we tested, DRAM bandwidth (or MCDRAM band-width for KNLs) is the limiting bandwidth for K-A

THENA ’s performance. did not have publicly available MHD codes implementingthe same method as K-A

THENA to directly compute theapplication efﬁciency, and so we used the architecturalefﬁciency obtained from the rooﬂine model.Since by its implementation K-A

THENA is always lim-ited by the DRAM bandwidth in the rooﬂine model, weused the DRAM architectural efﬁciency, or the theoreticalpeak performance of K-A

THENA if it could achieve 100%of the peak DRAM bandwidth to deﬁne the architecturalefﬁciency. More explicitly, the architectural efﬁciency onplatform i is e i ( a, p ) = ε a,p,i B i, DRAM × I a,p,i, DRAM (3)where ε a,p,i is the achieved performance of K-A THENA forsolving the problem p on the platform i , B i, DRAM is the peakDRAM bandwidth on the platform, and I a,p,i, DRAM is thearithmetic intensity of K-A

THENA for solving the problemon that platform (which is compiler dependent). For ex- S a n d y B r i d g e I v y B r i d g e H a s w e ll B r o a d w e ll S k y l a k e K n i g h t ' s L a n d i n g K e p l e r K K e p l e r K K e p l e r K V o l t a V A r c h i t e c t u r a l E ff i c i e n c y ( % ) DRAM Perf. Port. 83.1% L1 Perf. Port. 11.6%

CPU Machines GPU Machines Pleiades, Electra, Stampede 2, MSU HPCC, SummitCPU DRAMCPU L1GPU DRAMGPU L1

Fig. 3. Performance Portability plot of several CPU and GPU machineswith different architectures. Individual bars show the performance ofK-A

THENA compared to the theoretical peak performance limited bythe DRAM and L1 bandwidths. The performance portability metricscomputed from the harmonic mean of the two sets of efﬁciencies. ample, on Summit’s Volta V100s, K-A

THENA achieves . TFLOP/s while the DRAM/HBM bandwidth ceiling is at . TFLOP/s, leading to an architectural performancecompared to HBM bandwidth. For completeness, we alsocomputed the architectural efﬁciency as measured againstthe L1 cache bandwidth.For our purposes, we used a problem size roughlyproportional to the memory space and number of coresavailable on the device, maximizing the cell updates persecond achievable on each device. On CPUs, we used meshblocks of , using one MPI task per CPU core and onemesh block per task. We used all sockets on the node forCPU tests, which was a single KNL socket on Stampede 2and two sockets on all other machines. On KNLs, we alsoused two O PEN

MP threads per core instead of one. ForGPUs, we used a single mesh block with the largest sizethat would ﬁt in GPU DRAM, starting at × × on K20’s up to on V100’s, using just one GPU on anode. If we used a constant problem size across all the CPUand GPU platforms, the problem would be too large to ﬁton small GPUs, too small to occupy all of the capabilitieson other GPUs, and would not divide evenly across all theCPUs.In Fig. 3, the architectural efﬁciencies as measuredagainst the DRAM bandwidth and L1 cache bandwidth areshown with the computed performance portability metrics.K-A THENA achieved . DRAM performance portabil-ity and . L1 cache performance portability, measuredacross a number of CPU and GPU architectures. In general,K-A

THENA achieved higher efﬁciencies on newer CPUs andolder GPUs. This could be due to changes in instructionssets or trends in changing hardware details.

RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 8 total c e ll - u pd a t e s / s Single GPU

K-Athena VoltaK-Athena PascalGAMER Pascal total Athena++ SKXK-Athena SKXAthena++ BDWK-Athena BDWGAMER BDW

Fig. 4. Raw performance for double precision MHD (algorithm describedin Sec. 3) of K-A

THENA , A

THENA ++, and GAMER on a single GPU (left)or CPU (right) for varying problem sizes. Volta refers to an NVIDIA V100GPU, Pascal refers to an NVIDIA P100 GPU, BDW (Broadwell) refers toa 14-core Xeon E5-2680 CPU, and SKX (Skylake) refers to a 20-coreXeon Gold 6148 CPU. The GAMER numbers were reported in [22] forthe same algorithm used here.

In order to compare the degree to which the refactor-ing of A

THENA ++ affected performance we ﬁrst compareA

THENA ++ and K-A

THENA on a single CPU. The rightpanel of Fig. 4 shows the cell-updates/s achieved on an IntelBroadwell and an Intel Skylake CPU for both codes for vary-ing problem size. Overall, the achieved cell-updates/s arepractically independent of problem sizes reaching ≈ × on a single Broadwell CPU and ≈ . × on a singleSkylake CPU. Moreover, without any additional perfor-mance optimizations (see discussion in Sec. 4), K-A THENA isvirtually on par with A

THENA ++, reaching or more ofthe original performance. For comparison, we also show theresults of GAMER [22]. It is another recent (astrophysical)MHD code with support for CPU and (CUDA-based) GPUaccelerated calculations and has directly been compared toA

THENA ++ in [22]. We also ﬁnd that A

THENA ++ (and thusK-A

THENA ) is about 1.5 times faster than GAMER on thesame CPU.A slightly smaller difference (factor of ≈ . ) is ob-served when comparing results for GPU runs as shown inthe left panel of Fig. 4. On a P100 Pascal GPU, K-A THENA is about 1.3 times faster than GAMER, suggesting thatthe difference in performance is related to the fundamen-tal code design and not related to the implementation ofspeciﬁc computing kernels. On a single V100 Volta GPU,K-A

THENA reaches a peak performance of greater than cell-updates/s for large problem sizes. In general, theachieved performance in cell-updates/s is strongly depen-dent on the problem size. For small grids the performanceis more than one order of magnitude lower than what isachieved for the largest permissible grid sizes that still ﬁtinto GPU memory. This emphasizes that the nature of GPUsare better suited for compute-intensive tasks. Weak scaling results (same test problem and algorithm asin Sec. 3.3.1) for K-A

THENA and the original A

THENA ++version on different systems and architectures are shownin Fig. 5. Overall, the differences between K-A

THENA andA

THENA ++ on CPUs and Xeon Phis are marginal. This isexpected as K-A

THENA employed simd-for loops for allkernels that are similar to the ones already in A

THENA ++.Therefore, the parallel efﬁciency is also almost identicalbetween both codes, reaching ≈ on NASA’s Electrasystem with Skylake CPUs (ﬁrst column in Fig. 5) and ≈ on ALCF’s Theta system with Knights Landing XeonPhis (second column in Fig. 5) at 2,048 nodes each. Usingmultiple hyperthreads per core on Theta has no signiﬁcantinﬂuence on the results given the intrinsic variations ob-served on that system .The ﬁrst major difference is observed on OLCF’s Titan(third column in Fig. 5), where results for K-A THENA onGPUs are included. While the parallel efﬁciency for bothcodes remains at 94% up to 8,192 nodes using only CPUs, itdrops to 72% when using GPUs with K-A

THENA . However,the majority of loss in parallel efﬁciency already occursgoing from 1 to 8 nodes using GPUs and afterwards re-mains almost ﬂat. This behavior is equally present for CPUruns but less visible due to the higher parallel efﬁciencyin general. The differences in parallel efﬁciency betweenCPU and GPU runs can be attributed to the vastly differentraw performance of each architecture. On a single node thesingle Kepler K20X GPU is about 7 times faster than the16-core AMD Opteron CPU. Given that the interconnect isidentical for GPU and CPU communication, the effectiveratio of computation to communication is worse for GPUs.Despite the worse parallel efﬁciency on GPUs the raw per-node performance using GPUs is still about 5.5 times fasterthan using CPUs at 8,192 nodes.K-A

THENA on OLCF’s Summit system (last column inFig. 5) with six Volta V100 GPUs and two 21-core POWER9CPUs exhibits a GPU weak scaling behavior similar to theone observed on Titan. Going from 1 to 8 nodes results ina loss of 15% and afterwards the parallel efﬁciency remainsalmost ﬂat to 76% on 4,096 nodes. The CPU weak scalingresults for both codes using CPUs reveal properties of theinterconnect. The weak scaling is almost perfect up to 256nodes using 1 hyperthread per core and afterwards rapidlyplummets. Using 2 hyperthreads per core (i.e., doublingthe number of threads making MPI calls and doubling thenumber of MPI messages sent and received, as describedin Sec. 2.2) the steep drop in parallel efﬁciency is alreadyobserved beyond 128 nodes. No such drop is observed usingGPUs, which perform / times fewer MPI calls(compared to using 1 hyperthread per core) with larger mes-sages sizes in general. This suggests that the interconnect ishandling fewer but larger messages better than many smallmessages (which likely results in network congestion).Naturally, this issue is tightly related to the existingcommunication pattern in A THENA ++, i.e., coarse grainedthreading over meshblocks with each thread performing

4. According to the ALCF support staff, system variability con-tributes around 10% to the ﬂuctuations in performance between identi-cal runs.

RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 9 c e ll - u pd a t e s / s / n o d e Electra Skylake CPU

Athena++ 64 Athena++ 128 K-Athena 64 K-Athena 128 Theta Knights Landing

Athena++ HT-1Athena++ HT-2Athena++ HT-4K-Athena HT-1K-Athena HT-2K-Athena HT-4

Titan Opteron/Kepler GPU Summit Power9/Volta GPU10 p a r a ll e l e ff i c i e n c y Athena++ CPU 128 K-Athena CPU 128 K-Athena GPU 192 Athena++ CPU HT-1 64 Athena++ CPU HT-2 64 K-Athena CPU HT-1 64 K-Athena CPU HT-2 64 K-Athena GPU 256 K-Athena CPU nested 64 Fig. 5. Weak scaling for double precision MHD (exact algorithm described in Sec. 3) on different supercomputers and architectures for K-A

THENA and the original A

THENA ++ version. Numbers correspond to the 80th percentile of individual cycle performances of several runs in order to reduceeffects of network variability. The top row shows the raw performance in number of cell-updates per second per node and can directly be comparedbetween different system and architectures. The bottom row shows the parallel efﬁciency normalized to the individual single node performance.The ﬁrst column contains results for a workload of and cells per core on NASA’s Electra system using two 20-core Intel Xeon Gold 6148processors per node. The second column shows results for a workload of per core on ALCF’s Theta system with one 64-core Intel Xeon Phi7230 (Knights Landing) per node. HT-1, HT-2, and HT-4 refers to using 1, 2, and 4 hyperthreads per core, respectively. The third column showsresults for a workload of per CPU core and per GPU on OLCF’s Titan system with one AMD Opteron 6274 16-core CPU and one NVIDIAK20X (Kepler) GPU per node. The last column contains results for a workload of per CPU core and per GPU on OLCF’s Summit systemwith two 21-core IBM POWER9 CPUs and six NVIDIA V100 (Volta) GPUs per node. On all systems the GPU runs used loops and the CPU runsused simd-for loops with the the exception of the dashed purple line on Summit that used K OKKOS nested parallelism, see Sec. 2.3 for moredetails.

MPI calls. Without making additional changes to the codebase, this issue can directly be addressed using K

OKKOS nested parallelism in K-A

THENA . More speciﬁcally, we usethe triple nested construct illustrated in Listing 3 allowingmultiple threads handling a single meshblock. As a proofof concept, the results for using using 1 MPI process per 2cores each with one thread are shown in the purple dashline in the last column of Fig. 5. While the raw perfor-mance on a single node is slightly lower (about 16%), theimproved communication pattern results in a higher overallperformance for > , nodes. Similarly, the sharp dropin parallel efﬁciency has been shifted to ﬁrst occur at 2,048nodes.Finally, the raw per-node performance is overall com-parable between Intel Skylake CPUs, Intel Knight Land-ing Xeon Phis, IBM POWER9 CPUs, and a singleNVIDIA Kepler GPU, ranging between ≈ × cell-updates/s/node. The latest NVIDIA Volta GPU is a notableexception, reaching more than cell-updates/s/GPU.This performance in combination with six GPUs per node on Summit and a high parallel efﬁciency results in a totalperformance of . × cell-updates/s on 4,096 nodes. Strong scaling results for K-A

THENA on Summit on bothCPUs and GPUs are shown in Fig. 6 (same test problem andalgorithm as in Sec. 3.3.1). Overall, strong scaling in termsof parallel efﬁciency is better on CPUs than on GPUs. Forexample, for a , domain the parallel efﬁciency usingCPUs remains > going from 32 to 512 nodes whereas itdrops to for the similar GPU case ( , domain using36 to 576 nodes). This is easily explained by comparing tothe single CPU/GPU performance discussed in Sec. 3.3.1,which effectively corresponds to on-node strong scaling.The more pronounced decrease in parallel efﬁciency on theGPUs is a direct result of the decreased raw performance ofGPUs with smaller problem sizes per GPU. The increasedcommunication overhead of the strong scaling test playsonly a secondary role. As a result, additional performance RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 10 c e ll - u pd a t e s / s / n o d e Summit Power9/Volta GPU10 p a r a ll e l e ff i c i e n c y K-Athena GPU 1536 K-Athena GPU 3072 K-Athena CPU 1408 K-Athena CPU 2944 Fig. 6. Strong parallel scaling for double precision MHD (algorithmdescribed in Sec. 3) of K-A

THENA on NVIDIA V100 GPUs (6 GPUs pernode; green solid lines) and IBM Power 9 CPUs (42 cores per node;orange/red dash dotted lines) on Summit. The top panel shows the rawperformance in cell-updates per second per node and the bottom panelshows the parallel efﬁciency. The effective workload per GPU goes from to for the , domain and from to for the domain. In the CPU case the effective workload per single Power9 CPU(21 cores) goes from to for the , domain and from to for the , domain. The resulting effective workloads per nodeare comparable (within few percent) between GPU and CPU runs. improvements, as discussed in the following Section, willgreatly beneﬁt the strong scaling behavior of GPUs in gen-eral. Nevertheless, the raw performance of the GPUs stilloutperforms CPUs by a large multiple despite the worsestrong scaling parallel efﬁciency. For example, in the casediscussed above on Summit, the per-node performance ofGPUs over CPUs is still about 14 times higher at > nodes. URRENT LIMITATIONS AND FUTURE ENHANCE - MENTS

Our primary goal for the current version of K-A

THENA was to make GPU-accelerated simulations possible whilemaintaining CPU performance, and to do so with thesmallest amount of code changes necessary. Naturally, thisresulted in several trade-offs and leaves room for further(performance) improvements in the future.For example, we are currently not making use of moreadvanced hardware features such as scratch spaces on GPUs. Scratch space can be shared among threads of a

TeamPolicy and allows for efﬁcient reuse of memory. Wecould use scratch space to reduce the number of reads fromDRAM in stenciled kernels (like the reconstruction step). Wecould also fuse consecuetive kernels to further reduce readsand writes to DRAM, although this would also increase reg-ister and possibly spill store usage. Moreover, complex ker-nels such as a Riemann solver could be broken further downby using

TeamThreadRange s and

ThreadVectorRange sstructures that are closer to the structure of the algorithm.This is in contrast to our current approach where all kernelsare treated equally, with the same execution policies inde-pendent of the individual algorithms within the kernels. TheRiemann solver could also be split into separate kernels toreduce the number of registers needed, eliminate the use ofspill stores on the GPU, and allow higher occupancy on theGPU.Similarly, on CPUs and Xeon Phis we are currently notusing a K

OKKOS parallel execution pattern. The macro weintroduced to easily exchange parallel patterns replaces theparallel region on CPUs and Xeon Phis with a simple nested for loop including a simd pragma, as shown in Listing 1.This is required for maximum performance as current com-pilers are less effective in automatically vectorizing the morecomplex K

OKKOS patterns compared to simple for loops.On the one hand, we expect that this will become lessof an issue with future compiler improvements. On theother hand, more kernels and more ﬁne-tuned kernels, asdiscussed in the previous paragraph, will also automaticallyaddress this issue.Another possible future improvement is an increasein parallel efﬁciency by overlapping communication andcomputation. While A

THENA ++ is already built for asyn-chronous communication and a task based execution modelmore ﬁne-grained optimizations are possible. For example,spatial dimensions in the variable reconstruction step thatoccurs after the exchange of boundary information couldbe split, so that the kernel in the ﬁrst dimension couldrun while the boundary information of the second andthird dimension are still being exchanged. In addition, thenext major K

OKKOS release will contain more support forarchitecture-dependent task based execution and, for exam-ple, will allow for the transparent use of CUDA streams.CUDA streams may also help in addressing anothercurrent limitation of K-A

THENA on GPUs. Our minimalimplementation approach currently limits all meshblock sto be allocated in a ﬁxed memory space. This means that thetotal problem size that can currently be addressed with K-A

THENA is limited by the total amount of GPU memoryavailable. An alternative approach is keeping the entiremesh in system memory, which is still several times largerthan the GPU memory on many (if not all) current machines.For the execution of kernels individual meshblock s wouldbe copied back and forth between system memory and GPUmemory. Here, CUDA streams could be used to hide theseexpensive memory transfers as they would occur in thebackground while the GPU is executing different kernels.Theoretically, meshes larger than the GPU memory couldalready be used right now with the help of uniﬁed memory.However, given that the code is not optimized for efﬁcientpage migrations the resulting performance degradation is

RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 11 large (more than a factor of 10). Thus, using uniﬁed memorywith meshes larger than the GPU memory is not recom-mended.

ONCLUSIONS

We presented K-A

THENA – a K

OKKOS -based perfor-mance portable version of the ﬁnite volume MHD codeA

THENA ++. K

OKKOS is a C++ template library that pro-vides abstractions for on-node parallel regions and thememory hierarchy. Our main goal was to enable GPU-accelerated simulations while maintaining A

THENA ++’s ex-cellent CPU performance using a single code base and withminimal changes to the existing code.Generally, four main changes were required in therefactoring process. We changed the underlying memorymanagement in A

THENA ++’s multi-dimensional array classto make transparently use of K

OKKOS ’s equivalent multi-dimensional arrays, i.e.,

Kokkos::View s. We exchangedall (tightly) nested for loops with the K

OKKOS equivalentparallel region, e.g., a

Kokkos::parallel_for , which arenow kernels that can be launched on any supported device.We inlined all support functions (e.g., the equation of state)that are called within kernels. We changed the communica-tion buffers to be

View s so that MPI calls between GPUsbuffers are directly possible without going through systemmemory.With all changes in place we performed both proﬁlingand scaling studies across different platforms, includingNASA’s Electra system with Intel Skylake CPUs, ALCF’sTheta system with Intel Xeon Phi Knights Landing, OLCF’sTitan with AMD Opteron CPUs and NVIDIA Kepler GPUs,and OLCF’s Summit machine with IBM Power9 CPUs andNVIDIA Volta GPUs. Using a rooﬂine model analysis, wedemonstrated that the current implementation of the MHDalgorithms is memory bound by either the DRAM, HBM, orMCDRAM bandwidths on CPUs and GPUs. Moreover, wecalculated a performance portability metric of 83.1% acrossXeon Phis, and 5 CPU and 4 GPU generations.Detailed K

OKKOS proﬁling revealed that there is cur-rently no universal K

OKKOS execution policy (how a par-allel region is executed) that achieves optimal perfor-mance across different architectures. For example, a one-dimensional loop with manual index matching from 1 to3D/4D is fastest on GPUs (achieving > ) double preci-sion MHD cell-updates/s on a single NVIDIA V100 GPU)whereas tightly nested for loops with simd directives arefastest on CPUs. This is a result of both a lack in compilercapaabilities (e.g., with respect to automated vectorization)and K OKKOS ’s speciﬁc implementation details. Both areexpected to improve in the future and K

OKKOS allowsfor the ﬂexibility to easily exchange execution policies ingeneral.Strong scaling on GPUs is currently predominately lim-ited by individual GPU performance and not by commu-nication. In other words, insufﬁcient GPU utilization out-weighs additional performance overhead with decreasingproblem size per GPU.Weak scaling is generally good, with parallel efﬁcienciesof and higher for more than 1,000 nodes across allmachines tested. Notably, on Summit K-A

THENA achieves a total calculation speed of . × cell-updates/s on24,567 V100 GPUs at a speedup of 30 compared to using theavailable 172,032 CPU cores.Finally, there is still a great deal of untapped potentialleft, e.g., using more advanced hardware features suchas ﬁne-grained nested parallelism, scratch pad memory(i.e., fast memory that can be shared among threads), orCUDA streams. Nevertheless, we achieved our primaryperformance portability goal of enabling GPU-acceleratedsimulations while maintaining CPU performance using asingle code base. Moreover, we consider the current re-sults highly encouraging and will continue with furtherdevelopment on the project’s GitLab repository at https://gitlab.com/pgrete/kathena. Contributions of any kindare welcome! A CKNOWLEDGMENTS

The authors would like to thank the K

OKKOS developers,particularly Christian Trott and Steve Bova, and the orga-nizers of the 2018 Performance Portability with K

OKKOS

Bootcamp for their help using K

OKKOS in A

THENA ++.We thank Kristian Beckwith for inspiring discussions onK

OKKOS . We thank the A

THENA ++ team for making theircode public and for their well designed code. We acknowl-edge funding by NASA Astrophysics Theory Program grant R EFERENCES [1] Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick, “Upc++:A pgas extension for c++,” in , May 2014, pp. 1105–1114.[2] L. V. Kale and S. Krishnan, “Charm++: A portable concurrentobject oriented system based on c++,” Champaign, IL, USA, Tech.Rep., 1993.[3] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, “Legion:Expressing locality and independence with logical regions,” in

Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis , ser. SC ’12. LosAlamitos, CA, USA: IEEE Computer Society Press, 2012, pp.66:1–66:11. [Online]. Available: http://dl.acm.org/citation.cfm?id=2388996.2389086[4] C. J. White, J. M. Stone, and C. F. Gammie, “An extension ofthe athena++ code framework for grmhd based on advancedriemann solvers and staggered-mesh constrained transport,”

TheAstrophysical Journal Supplement Series , vol. 225, no. 2, p. 22, 2016.[Online]. Available: http://stacks.iop.org/0067-0049/225/i=2/a=22[5] J. Stone, K. Tomida, K. G. Felker, and C. White, “Athena++: a newcode for radiation grmhd,” in prep. , 2019.[6] T. P. Straatsma, K. B. Antypas, and T. J. Williams,

Exascale ScientiﬁcApplications: Scalability and Performance Portability , 1st ed. Chap-man & Hall/CRC, 2017.[7] L. Dagum and R. Menon, “Openmp: An industry-standardapi for shared-memory programming,”

IEEE Comput. Sci.Eng. , vol. 5, no. 1, pp. 46–55, Jan. 1998. [Online]. Available:https://doi.org/10.1109/99.660313[8] J. E. Stone, D. Gohara, and G. Shi, “Opencl: A parallel program-ming standard for heterogeneous computing systems,”

Computingin Science Engineering , vol. 12, no. 3, pp. 66–73, May 2010.

RETE et al. : K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 12 [9] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos:Enabling manycore performance portability through polymorphicmemory access patterns,”

Journal of Parallel and DistributedComputing

ACM Trans. Math. Softw. ,vol. 31, no. 3, pp. 397–423, Sep. 2005. [Online]. Available:http://doi.acm.org/10.1145/1089014.1089021[12] J. K. Holmen, A. Humphrey, D. Sunderland, and M. Berzins,“Improving uintah’s scalability through the use of portablekokkos-based data parallel tasks,” in

Proceedings of thePractice and Experience in Advanced Research Computing 2017on Sustainability, Success and Impact , ser. PEARC17. NewYork, NY, USA: ACM, 2017, pp. 27:1–27:8. [Online]. Available:http://doi.acm.org/10.1145/3093338.3093388[13] J. M. Stone, T. A. Gardiner, P. Teuben, J. F. Hawley, and J. B. Simon,“Athena: A new code for astrophysical mhd,”

The AstrophysicalJournal Supplement Series , vol. 178, no. 1, p. 137, 2008. [Online].Available: http://stacks.iop.org/0067-0049/178/i=1/a=137[14] J. M. Stone and T. Gardiner, “A simple unsplit godunov methodfor multidimensional mhd,”

New Astronomy

Commun. ACM , vol. 52, no. 4, pp. 65–76, Apr. 2009.[16] S. Pennycook, J. Sewall, and V. Lee, “Implications of a metricfor performance portability,”

Future Generation Computer Systems

HighPerformance Computing Systems. Performance Modeling, Benchmark-ing, and Simulation , S. A. Jarvis, S. A. Wright, and S. D. Hammond,Eds. Springer International Publishing, 2015, pp. 129–148.[18] E. Konstantinidis and Y. Cotronis, “A quantitative rooﬂine modelfor GPU kernel performance estimation using micro-benchmarksand hardware metric proﬁling,”

Journal of Parallel and DistributedComputing , vol. 107, pp. 37–56, Sep. 2017.[19] “

Intel R (cid:13) Software Development Emulator ,” https://software.intel.com/en-us/articles/intel-software-development-emulator,Accessed: May 1, 2019.[20] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A LightweightPerformance-Oriented Tool Suite for x86 Multicore Environ-ments,” in , Sep. 2010, pp. 207–216.[21] “

Intel R (cid:13) VTune TM Ampliﬁer ,” https://software.intel.com/en-us/vtune, Accessed: May 1, 2019.[22] U.-H. Zhang, H.-Y. Schive, and T. Chiueh, “Magnetohydro-dynamics with gamer,”

The Astrophysical Journal SupplementSeries , vol. 236, no. 2, p. 50, 2018. [Online]. Available:http://stacks.iop.org/0067-0049/236/i=2/a=50

Philipp Grete

After receiving a B.Sc. in Com-puter Science in 2008 from the University ofCooperative Education in Stuttgart, Germany,Philipp Grete worked for Hewlett-Packard be-fore studying Physics (B.Sc.) and Computer Sci-ence (M.Sc.) at the University of Gttingen, Ger-many, from 2010 to 2013. In his Ph.D. the-sis (2014-2016, University of Gttingen, Ger-many) he worked on subgrid-scale modelingof compressible magnetohydrodynamic turbu-lence. Since October 2016, he is a postdoctoralresearch associate at Michigan State University. His current researchinterest include fundamental processes involving magnetic ﬁelds in as-trophysical ﬂuids, numerical methods in computational ﬂuid dynamics,and high-performance computing with an emphasis on performanceportability.

Forrest W. Glines

Forrest Glines is PhD studentat Michigan State University working towards adual degree in Astrophysics and ComputationalMathematics, Science, and Engineering. He re-ceived his B.S. in Physics from Brigham YoungUniversity in 2016, publishing work on magne-tohydrodynamics methods using GPUs. In the2015 and 2016 he also spent 6 months collec-tively at Los Alamos National Laboratory, simu-lating galaxy cluster and developing simulationscoupling radiative transfer and hydrodynamics.Forrest’s current research interests includes the coupling of galaxyclusters with active galactic nuclei, plasma turbulence, the evolutionof galaxies with magnetic ﬁelds, and developing simulations for theexascale era.