[PDF] Portable high-order finite element kernels I: Streaming Operations

Abstract

This paper is devoted to the development of highly efficient kernels performing vector operations relevant in linear system solvers. In particular, we focus on the low arithmetic intensity operations (i.e., streaming operations) performed within the conjugate gradient iterative method, using the parameters specified in the CEED benchmark problems for high-order hexahedral finite elements. We propose a suite of new Benchmark Streaming tests to focus on the distinct streaming operations which must be performed. We implemented these new tests using the OCCA abstraction framework to demonstrate portability of these streaming operations on different GPU architectures, and propose a simple performance model for such kernels which can accurately capture data movement rates as well as kernel launch costs.

Full PDF

aa r X i v : . [ c s . M S ] S e p Portable high-order ﬁnite element kernels I: StreamingOperations

Noel Chalmers ∗ Tim Warburton † September 24, 2020

Abstract

This paper is devoted to the development of highly eﬃcient kernels performing vector opera-tions relevant in linear system solvers. In particular, we focus on the low arithmetic intensity op-erations (i.e., streaming operations) performed within the conjugate gradient iterative method,using the parameters speciﬁed in the CEED benchmark problems [2] for high-order hexahedralﬁnite elements. We propose a suite of new Benchmark Streaming tests to focus on the distinctstreaming operations which must be performed. We implemented these new tests using theOCCA abstraction framework [13] to demonstrate portability of these streaming operations ondiﬀerent GPU architectures, and propose a simple performance model for such kernels whichcan accurately capture data movement rates as well as kernel launch costs.

Keywords : High-order, ﬁnite element, GPU, AMD HIP, CUDA

The current outlook for scientiﬁc applications in high performance computing (HPC) is a hetero-geneous one, where high performance systems consist of a single or dual socket multi-core CPUpaired with several GPU accelerator devices. This landscape has become the de facto norm sincethe introduction of the Summit and Sierra supercomputers, both of which utilize Nvidia V100 GPUaccelerators. In these systems, eﬀective use of the accelerators on each node is required to harnessthe system’s high bandwidth and ﬂoating-point operation (FLOP) rates. Currently, on the horizonfor HPC is the Aurora supercomputer which will use Intel-manufactured GPU accelerators, and theFrontier and El Capitan supercomputers which will use AMD GPU accelerators. There is thereforea need for HPC applications currently implemented with Nvidia’s CUDA programming model to beportable to other device architectures, and to understand performance characteristics of these newplatforms as we approach exascale computing.A recently formed co-design center within the U.S. Department of Energy (DOE) Exascale Com-puting Project (ECP) is the Center for Eﬃcient Exascale Discretizations. One of this co-designcenter’s research thrusts has been to propose a suite of Benchmark Problems (BPs) [2] to serve asperformance tests for comparison of several software implementations of high-order ﬁnite elementmethods. Each benchmark problem consists of measuring the performance of solving a linear sys-tem with a matrix originating from diﬀerent high-order ﬁnite element operators, via a ConjugateGradient (CG) iterative method. In [8] the benchmark problems are discussed in detail, along with ∗ AMD Research, Advanced Micro Devices Inc., 7171 Southwest Pkwy, Austin, TX, 78735. † Department of Mathematics, 225 Stanger Street, Virginia Tech, VA 24061. TM VII, andan AMD Radeon Instinct TM MI60. We also propose a simple performance model for the BS testswhich we show can accurately model the observed bandwidths of each GPU considered.It should be noted that memory bandwidth benchmarks for GPU accelerators is not a new pro-posal. Modern benchmarks such as Bablestream [6], Mixbench [12], and SHOC [5] contain streamingkernel tests similar to some of the streaming tests we propose here. In particular, Babelstream con-siders ‘add’, ‘triad’, and ‘dot’ tests which are similar to the AXPY and inner product tests wepropose. Triad and a reduction test are also considered in SHOC. Each test we include in the BSsuite, including ones similar to tests in other suites, is done for completeness, since each test wepropose is relevant in the CG iterative method.The remainder of this paper is organized as follows: we ﬁrst give some brief details of theformulation of the CEED BPs in order to motivate the proposed gather and scatter streaming tests.We then detail our proposed Benchmark Streaming tests, and we propose a simple performancemodel of these streaming tests. Afterwards, we give some details of our open-source implementationof our proposed tests. Finally, we present performance data for each of the BS tests on four separateGPU accelerator devices, and conclude with summarizing remarks.

Here we give brief details regarding the formulation of the CEED BPs. While we do not consider thefull benchmark problems in this paper, we introduce the basic ﬁnite element formulation in order todetail two speciﬁc streaming operations, namely the gather and scatter operations.To begin, we consider an unstructured mesh of a domain

D ⊂ R into K hexahedral elements D e , where e = 1 , . . . , K , such that D = K [ e =1 D e . (1)We assume that each hexahedral element D e is an image of the reference bi-unit cube ˆ D = { ( r, s, t ) : − ≤ r, s, t ≤ } under a bijective map Γ e : ˆ D → D e . On each element D e we consider a set of nodes N ep = { x ei , y ej , z ek } pi,j,k =0 deﬁned as the image of the tensor product of p + 1 Gauss-Lobatto-Legendre(GLL) nodes on reference cube, denoted ˆ N p = { r i , s j , t k } pi,j,k =0 , under the map Γ e .On each element, we consider the space of polynomials of degree at most p in each spatialdimension, and denote this space Q p ( D e ). As a polynomial basis of Q p ( D e ), we choose La-grange interpolation polynomials, interpolating the nodal set N ep . On the reference cube ˆ D we2nd that the basis of Q p ( ˆ D ) can be written as a tensor product of the one-dimensional Lagrangeinterpolation polynomials { l i } i = pi =0 interpolating the GLL nodes on the interval [ − , Q p ( ˆ D ) = span( { l ijk ( r, s, t ) } pi,j,k =0 ) where l ijk ( r ) = l i ( r ) l j ( s ) l k ( t ) , for all 0 ≤ i, j, k ≤ p , and r = ( r, s, t ). Moreover, due to the construction of N ep , the polynomialbasis of Q p ( D e ) can be written as in terms of these reference basis polynomials. That is, given a u e ∈ Q p ( D e ), we can write u e as u e ( x ) = p X i =0 p X j =0 p X k =0 u eijk l ijk (cid:0) Γ − e ( x ) (cid:1) , (2)where x = ( x, y, z ). From to the deﬁnition of the Lagrange polynomials l i and the element mappingΓ e , we ﬁnd that the basis coeﬃcients u eijk satisfy the interpolation property u e ( x i , y j , z k ) = u eijk forall ( x i , y j , z k ) ∈ N ep . We denote by u e ∈ R ( p +1) the vector of interpolation values u eijk on D e .On the full domain D we deﬁne the ﬁnite element space V to be V = n v ∈ H ( D ) (cid:12)(cid:12)(cid:12) v | D e ∈ Q p ( D e ) , e = 1 , . . . , K o . From the interpolation property of the polynomial basis of Q p ( D e ) on each element D e , we see thata function v ∈ V is completely determined by its nodal values at each of the unique interpolationnodes in the mesh of D . Let us denote this set of all the unique interpolation nodes in D by N G and deﬁne N G = |N G | to be the total number of interpolation nodes in the mesh. For a given v ∈ V we can therefore represent v as the vector v G ∈ R N G of the values of v at each of the interpolationnodes in N G .It is also useful to represent a function v ∈ V as a larger vector of length N L = K ( p + 1) whose entries are ordered by the elements D e in the mesh of D . That is, we represent v as a vector v L ∈ R N L , such that v L = ( v , . . . , v K ) T . Each entry of v L therefore corresponds to distinct entryin v G and the two vectors can be connected via a scatter operator, Z : v L = Z v G . (3)The matrix Z is therefore an N L × N G Boolean matrix with each row containing precisely onenonzero. The transpose of the scatter operator, Z T , often referred to as the gather operator. Bothoperators are described in more detail in [1], [7], and [10].The speciﬁc ﬁnite element operators considered in each of the CEED benchmark problems consistof some variant of the problem of ﬁnding u ∈ V such that Z D ∇ v T ∇ u + λvu d x = Z D vf d x (4)for all v ∈ V , given a function f ∈ V . After discretizing via the ﬁnite element method, the detailsof which we omit here, the global problem (4) can be written as the following system for a vector ofglobal degrees of freedom u G , S G u G + λM G u G = b G , (5)where the action of the global stiﬀness matrix S G and mass matrix M G are formed by the compositionof scattering, local stiﬀness/mass action, and gathering operations. That is, the following forms hold S G = Z T S L Z, (6) M G = Z T M L Z, (7)3 lgorithm 1 Conjugate Gradient Method Input: (1) Initial guess x , (2) Right-hand-side vector b , (3) Linear operator A , (4) Tolerance ǫ . Output: (1) Solution x , (2) Iteration count j Set j = 0 Set r = b − A x Set p = r while ( r j · r j > ǫ ) do α = r j · r j p · A p x = x + α p r j +1 = r j − αA p β = r j +1 · r j +1 r j · r j p = r j +1 + β p j = j + 1 end while where S L and M L are block diagonal matrices whose blocks contain non-zeros only on the degreesof freedom local to each element.It is common in some high-order element software to consider the system (5) instead in a scatteredform by applying the scatter operator Z to both sides of (5), and solving for the scattered degreesof freedom, recalling the deﬁnition of the scattered vector of degrees of freedom in (3). However,storing the full ﬁnite element problem in the scattered degrees of freedom u L will require signiﬁcantlymore data movement compared to the considerably smaller vector u G . Since we do not considerthe solution of the full ﬁnite element problem (5) in this paper, we will instead investigate theperformance of both the gather and scatter operators to investigate the cost of moving betweengathered and scattered degree of freedom storage. Each of the CEED benchmark problems involves performing a Conjugate Gradient iterative method.We list the pseudo-code for this iterative method in Algorithm 1. The method consists of thefollowing simple vector operations: • Vector addition: y = α x + β y • Dot product: α = x · y • Norm: α = x · x • Matrix-vector Product: y = A x The ﬁrst three operations perform only a small number of ﬂoating point operations per vector valueloaded, and therefore their performance is limited by the rate at which vector data can be streamedto and from main memory. Likewise, though the element local blocks of the ﬁnite element operators S L and M L are dense, the transition between global and local degrees of freedom via the scatteroperator Z in (6)-(7) is of similarly low arithmetic intensity. We therefore refer to these operationsas streaming operations. 4est Name DescriptionBS1 Vector Copy y = x BS2 Vector AXPY y = α x + β y BS3 Vector Norm α = x · x BS4 Vector Inner Product α = x · y BS5 Fused CG Update  x = x + α p , r = r − αA p ,β = r · r . BS6 Gather x G = Z T x L BS7 Scatter x L = Z x G Table 1: Summary of Benchmark Streaming problems.To help investigate the performance and portability of the relevant streaming operations in theCEED benchmarks, we propose a suite of Benchmark Streaming (BS) problems, listed in Table1. These BS problems consists of individual tests for the following vector operations: copy, AXPY,norm, inner product, fused CG update, gather, and scatter. Each of the BS problems is motivated byvector operations performed in the CG iterative method. Note that the fused CG update originatesfrom noticing that steps 8-10 of the CG method in Algorithm 1 can be fused into a single operation,which helps save data movement from an additional read/write of r .The relevant ﬁgure of merit in each of the BS problems should be data throughput, measured inGB/s. To eliminate sources of variance between diﬀerent vendors’ implementations of device timersin heterogeneous compute models, we measure all time with respect to the host process with a timersuch as MPI_Wtime .The ﬁrst 5 BS problems are simple vector operations and thus are more ﬂexible in that they canbe performed in absence of any knowledge of an underlying ﬁnite mesh. BS6 and BS7, however,require knowledge of the ﬁnite element mesh via the scatter operator Z . For these problems, wespecify the same conﬁguration as detailed for the CEED BPs above. We note that although thegather/scatter operators in BS6 and BS7 could have a non-trivial parallel implementation with MPI,we restrict our attention to solely to local computations on a single MPI rank in this paper. The goal of the streaming tests proposed in this paper is not simply to obtain a measure of aprocessor’s maximum bandwidth in these simple vector operations, but to provide valuable insightinto modeling such operations when estimating factors such as weak- and strong-scaling eﬃciency.To this end, we propose here a simple performance model of these streaming kernels, within whichare two useful performance parameters.In most modern fork & join programming paradigms, for example oﬄoading to a GPU accelerator,it is common for there to be some measure of ‘launch latency’. That is, there is some non-zero timerequired simply to oﬄoad work to an accelerator, or fork and join a group of threads. Let us denotethe time it takes to stream B bytes of data as T ( B ). A simple but eﬀective model for T ( B ) consistsof this minimum time it will take to launch a streaming kernel, T , plus the time it will take tostream B bytes. Assuming that the time it takes to stream B bytes is simply a linear function of5 // BS1 : Copy an a r r a y2 @ k e r n e l v o i d BS1 ( c o n s t i n t N, // number o f e n t r i e s i n a r r a y3 @ r e s t r i c t c o n s t do uble ∗ a , // i n p u t : d e v i c e a r r a y4 @ r e s t r i c t do uble ∗ b ) { // o utput : d e v i c e a r r a y56 // p a r a l l e l l o o p b l o c k e d o v e r thr ea d − b l o c k s o f s i z e p BLOCKSIZE7 f o r ( i n t n=0;n < N;++n ; @ t i l e (p BLOCKSIZE , @ o u t e r , @ i n n e r ) ) { } } Listing 1: BS1: entry-wise copy of an array.1 // BS2 : AXPY2 @ k e r n e l v o i d BS2 ( c o n s t i n t N, // number o f e n t r i e s i n a r r a y3 c o n s t do uble alpha , // i n p u t : s c a l a r4 @ r e s t r i c t c o n s t do uble ∗ x , // i n p u t : d e v i c e a r r a y5 c o n s t do uble beta , // i n p u t : s c a l a r6 @ r e s t r i c t do uble ∗ y ) { // o utput : d e v i c e a r r a y78 // p a r a l l e l l o o p b l o c k e d o v e r thr ea d − b l o c k s o f s i z e p BLOCKSIZE9 f o r ( i n t n=0;n < N;++n ; @ t i l e (p BLOCKSIZE , @ o u t e r , @ i n n e r ) ) {

10 y [ n ] = a lpha ∗ x [ n]+ beta ∗ y [ n ] ;11 } } Listing 2: BS2: Scaled in-place addition of two arrays.6 // BS3 1 : 1 s t r e d u c t i o n k e r n e l f o r computing v e c t o r norm2 @ k e r n e l v o i d BS3 1 ( c o n s t i n t Nblocks , // num e n t r i e s i n p a r t i a l l y − r educed a r r a y3 c o n s t i n t N, // number o f e n t r i e s i n a r r a y4 @ r e s t r i c t c o n s t do uble ∗ a , // i n p u t :d e v i c e a r r a y5 @ r e s t r i c t do uble ∗ norm tmp ) { // o utput : p a r t i a l l y − r educed norm67 // p a r a l l e l l o o p o v e r o f Nblo cks work − g r o ups8 f o r ( i n t b=0;b < Nblo cks;++b ; @ o u t e r ( 0 ) ) {

910 @ s h a r e d v o l a t i l e do uble s norm [ p BLOCKSIZE ] ;1112 f o r ( i n t t =0; t < p BLOCKSIZE;++ t ; @ i n n e r ( 0 ) ) {

13 s norm [ t ] = 0 . 0 ;14 f o r ( i n t i d=t+b ∗ p BLOCKSIZE ; id < N; i d+=p BLOCKSIZE ∗ Nblo cks ) {

15 s norm [ t ] += a [ i d ] ∗ a [ i d ] ;16 } } > =221 f o r ( i n t k=p BLOCKSIZE / 2 ; k > >> =1) {

22 f o r ( i n t t =0; t < p BLOCKSIZE;++ t ; @ i n n e r ( 0 ) ) {

23 i f ( t < k ) s norm [ t ] += s norm [ t+k ] ;24 } } < p BLOCKSIZE;++ t ; @ i n n e r ( 0 ) )28 i f ( t <

1) norm [ b ] = s norm [ 0 ] + s norm [ 1 ] ;29 } } Listing 3: BS3: Simple partial reduction kernel for computing vector norm.7 // BS6 : Gather2 @ k e r n e l v o i d BS6 ( c o n s t i n t Nblocks , // number o f b l o c k s i n g r i d3 @ r e s t r i c t c o n s t i n t ∗ b l o c k S t a r t s , // o f f s e t a r r a y b l o c k i n g rows4 @ r e s t r i c t c o n s t i n t ∗ r o w S t a r t s , // row s t a r t s i n c s r da ta s t r u c t5 @ r e s t r i c t c o n s t i n t ∗ c o l I d s ,// column i d s i n c s r da ta s t r u c t6 @ r e s t r i c t c o n s t do uble ∗ q , // i n p u t :d e v i c e a r r a y7 @ r e s t r i c t do uble ∗ g a t h e r q ) { // o utput : g a t h e r e d d e v i c e a r r a y89 f o r ( i n t b=0;b < Nblo cks;++b ; @ o u t e r ( 0 ) ) {

10 @ e x c l u s i v e i n t b l o c k S t a r t , blockEnd ;11 @ s h a r e d do uble temp [ p NODES PER BLOCK ] ;1213 // r ea d a l l da ta needed t o g a t h e r a b l o c k o f rows14 f o r ( i n t n=0;n < p BLOCKSIZE;++n ; @ i n n e r ( 0 ) ) {

15 b l o c k S t a r t = b l o c k S t a r t s [ b ] ;16 blockEnd = b l o c k S t a r t s [ b + 1 ] ;1718 c o n s t i n t row = r o w S t a r t s [ b l o c k S t a r t ]+n ;19 // c h e c k i n g f o r t h e l a s t b l o c k l e t s t h e c o m p i l e r u n r o l l t h e l o a d20 i f ( b!= Nblocks − < p NODES PER BLOCK ; i+=p BLOCKSIZE)22 temp [ i+n ] = q [ c o l I d s [ i+row ] ] ;23 e l s e24 f o r ( i n t i =0;row+i < r o w S t a r t s [ blockEnd ] ; i+=p BLOCKSIZE)25 temp [ i+n ] = q [ c o l I d s [ i+row ] ] ;26 } < p BLOCKSIZE;++n ; @ i n n e r ( 0 ) ) {

30 f o r ( i n t row = n+b l o c k S t a r t ; row < blockEnd ; row+=p BLOCKSIZE) {

31 c o n s t i n t c o l S t a r t = r o w S t a r t s [ row ] − r o w S t a r t s [ b l o c k S t a r t ] ;32 c o n s t i n t colEnd = r o w S t a r t s [ row+1] − r o w S t a r t s [ b l o c k S t a r t ] ;33 do uble gq = 0 . 0 ;34 f o r ( i n t i=c o l S t a r t ; i < colEnd ; i ++)35 gq += temp [ i ] ;3637 g a t h e r q [ row ] = gq ;38 } } } } Listing 4: BS6: Gather operator.8 // BS7 : S c a t t e r2 @ k e r n e l v o i d BS7 ( c o n s t i n t N, // number o f e n t r i e s i n s c a t t e r e d a r r a y3 @ r e s t r i c t c o n s t i n t ∗ s c a t t e r I d s , // i d s o f s c a t t e r e d v a l u e s4 @ r e s t r i c t c o n s t do uble ∗ g a ther q , // i n p u t : g a t h e r e d a r r a y5 @ r e s t r i c t do uble ∗ q ) { // o utput : s c a t t e r e d a r r a y67 f o r ( i n t n=0;n < N;++n ; @ t i l e (p BLOCKSIZE , @ o u t e r ( 0 ) , @ i n n e r ( 0 ) ) ) { > =0)10 q [ n ] = g a t h e r q [ i d ] ;11 } } Listing 5: BS7: Scatter operator.the amount of data B , a simple Amdahl-type model for T B can be written as T ( B ) = T + BW max . (8)The parameter W max is therefore the asymptotic maximum bandwidth that can be attained by thekernel for streaming data to and from the device’s main memory. The eﬀective bandwidth achievedby the device when streaming B bytes of data would therefore be modeled as W eﬀ ( B ) = BT ( B ) , = BT + BW max . (9)Fitting this model to the measured bandwidth in each of the streaming tests therefore yields severaluseful estimates. First, T gives us an estimate of the ﬁxed launch cost. Second, W max gives usan estimate for the asymptotic peak bandwidth which the device can eﬀectively achieve. Together,these parameters used in (9) can give eﬀective timing/bandwidth estimates for diﬀerent sized buﬀersbeing streamed. For example, it can be useful to know precisely how much data must be moved bya kernel for the device to be operating at 80% of its theoretical maximum throughput. This 80%eﬃciency size in bytes, which we denote by B . , has been termed the Fischer-Brown point. Thisvalue can be predicted by our performance model as follows W eﬀ ( B . ) = 0 . W max ,B . T + B . W max = 0 . W max , ⇒ B . = 4 T W max . (10)Since the product T W max is a measure of how much data the device could have been theoreticallymoving in the time taken to oﬄoad and sync, this result matches intuition in saying that to reach80% eﬃciency we should move four times this amount of data in the kernel.9 Implementation

We implement each of the benchmark streaming tests described above in C++, using a stripped-down version of the open-source ﬁnite-element library libParanumal [3] as a base to perform meshgeneration and handle data structures. The source code for each of our BS tests is available freelyin the streamParanumal GitHub repository [4].The libParanumal core of our benchmark implementation uses the Open Concurrent ComputeAbstraction (OCCA) library [13] to abstract between diﬀerent parallel languages such as OpenMP,OpenCL TM , CUDA, and HIP. OCCA allows us to implement the parallel kernel code for each ofthe BS tests in a (slightly decorated) C++ language, called OKL. At runtime, the user can specifywhich parallel programming model to target, after which OCCA translates the OKL source code intothe desired target language and Just-In-Time (JIT) compiles kernels for the user’s target hardwarearchitecture.In the OKL language, parallel loops and variables in special memory spaces are described withsimple attributes. For example, iterations of nested parallel for loops in the kernel are annotated with @outer and @inner to describe how they are to be mapped to a grid of work-item and work-groupsin OpenCL TM or threads and thread-blocks in CUDA and HIP. All iterations that are annotated with @outer or @inner are assumed to be free of loop carried dependencies.The compute kernels for BS1 and BS2 are relatively short, given the simple nature of bothbenchmarks. We list the OKL source for BS1 and BS2 in Listings 1 and 2, respectively. Thesekernels can be written compactly using OCCA’s @tile attribute, which instructs the loop to beparallelized over a grid of blocks of size p_BLOCKSIZE , which we typically take to be 256 or 512.The BS3-5 benchmarks, on the other hand, each require performing some sort of reduction. Thesebenchmarks are implemented using two compute kernels each, the ﬁrst of which performs a partialreduction of the needed array, while the second kernel completes the reduction to a single value. Welist a particularly simple example of a partial reduction performed for computing the vector normfor BS3 in Listing 3. This example OKL kernel reduces the norm of the entire vector to an array of Nblocks entries, which we typically take to be some relatively small value such as 512. The secondkernel which reduces this partially reduced array to a single value would be written analogously.The implementation of a reduction in Listing 3 performs quite well for each GPU architecturewe consider in our computational tests below. However, it is not necessarily the only implementa-tion of a reduction used in our BS tests. Often, small improvements in the asymptotic maximumachieved bandwidth ( W max in our performance model) can be obtained by implementing reductionsby completing the last warp/wavefront-wide reduction in register memory using intrinsic shfl in-structions (or shfl_sync in the case of CUDA 9+). These special reduction implementations canbe speciﬁed based on our target device at JIT compilation time. For brevity, we do not list eachof these specialized reductions in this paper, nor the OKL kernel source code for the BS4 and BS5tests, but they can be found in the source code for each of the BS tests in [4].To implement the BS6 and BS7 tests we require several data structures to be constructed duringmesh setup. In particular, we require some sort of sparse matrix representation of the scatter andgather operators, Z and Z T , respectively, described above. In the libParanumal mesh setup code,we simplify how these sparse matrices are stored by noting that each entry of Z is 1, and that Z has one non-zero entry per row. This allows us to implement the gather and scatter operators forBS6 and BS7 with very specialized sparse matrix multiplication compute kernels.We list the OKL kernel source for BS6 in Listing 4. This kernel’s implementation is inspired by thestudy of optimized sparse CSR matrix-vector multiplication on GPU accelerators in [9]. In particular,we assume that, in addition to representing the non-zero entries of the gather operator Z T in sparseCSR format (omitting the array of matrix values, since they are all ones), we have constructed an10rray of oﬀsets into the array of row starts, which we call the blockStarts arrays. We populatethis array such that the number of rows to be processed by each thread-block is approximatelybalanced. Speciﬁcally, rows are blocked in such a way that each thread-block reads approximately p_NODES_PER_BLOCK entries in the sparse CSR matrix into shared memory, where the actual matrixaction is computed. We typically select p_BLOCKSIZE to be 256 and p_NODES_PER_BLOCK to besome small multiple of p_BLOCKSIZE , such as 512. This allows each thread-block to read and writeapproximately the same amount of data, in an eﬃcient manner. Since the matrix Z T has at most onenon-zero per column, there is relatively little opportunity for eﬀective cache utilization when readingin the sparse entries of q , but spatial locality of the local DOF storage in q may help somewhat.Finally, we list the OKL kernel source for BS7 in Listing 5. In comparison to the gather operatorkernel, this kernel is quite simple. This is because in this kernel we can leverage the fact thateach row of the scatter operator Z has one non-zero entry in order to compute a mapping array scatterIds which maps each of the local storage DOFs to its index in the global DOF array. Wereserve negative values in the scatterIds array for the case where degrees of freedom are maskedout of the scatter operator action, say through enforcement of boundary conditions. Aggregate memory (MB) Device memoryModel TDP CUs REG LDS L1 L2 HBM2AMD Radeon TM VII 250 W 60 15 3.75 0.9 4 16 GBAMD MI60 225 W 64 16 4 1 4 32 GBNvidia TitanV 250 W 80 20 7.5 2.5 4.5 12 GBNvidia V100 300 W 80 20 7.5 2.5 6 32 GBTable 2: Speciﬁcations for four contemporary graphics processing units. Here: CU stands for thenumber of compute units on AMD GPUs or the number of streaming multiprocessors on NvidiaGPUs, REG is the aggregate size of the register ﬁles, LDS is the aggregated amount of local sharedcache, L1 is the aggregate amount of lowest level cache, L2 is the total amount of second level cache,and HBM2 is the amount of high-bandwidth memory.In this section, we present results from our implementation of the BS benchmarks on four currentgeneration GPU accelerators from two manufacturers. We demonstrate that each of our tests is ableto obtain reasonably high throughput of device global HBM memory, and that our performancemodel detailed above can ﬁt the observed performance on each of the considered devices very well.

The four contemporary GPU accelerator devices considered in this paper are the Titan V andV100 GPUs manufactured by Nvidia, and the Radeon TM VII and Radeon Instinct TM MI60 GPUsmanufactured by AMD. In Table 2 we show some physical speciﬁcations for each of the consideredGPUs including their thermal design power (TDP), number of stream multiprocessors/computeunits (CUs), aggregate register space (REG), aggregate shared memory/local data storage (LDS),aggregate default lowest level cache (L1), aggregate second level cache size (L2), and amount ofsecond generation high bandwidth memory (HBM2).The most important performance metric of each of the GPUs considered for their overall perfor-mance in the BS benchmarks will be the delivered bandwidth to global HBM memory. While the11hysical speciﬁcation of the Nvidia V100 and Titan V GPUs diﬀer only in their TDP and HBM2amounts, the former has four 8 GB HBM2 memory stacks and the latter only has three 4 GB HBM2memory stacks. This leads to a diﬀerence in theoretical peak HBM bandwidth between the twodevices, with the V100 having a theoretical peak HBM bandwidth of 900 GB/s and the Titan Vhaving a theoretical peak HBM bandwidth of 653 GB/s. Since each memory stack has associatedL2 cache, this also leads to the diﬀerence in L2 cache size.By contrast, the AMD Radeon TM VII and Radeon Instinct TM MI60 each have four HBM2 memorystacks, the former using 4 GB stacks and the latter 8 GB stacks. Both GPUs therefore have atheoretical peak HBM bandwidth of 1000 GB/s. Despite this advantage in theoretical bandwidthover the Nvidia GPUs, it is important to note the AMD GPUs also have comparatively fewercompute units, lower aggregate register count, smaller aggregate LDS space, smaller aggregate L1cache, and a smaller L2 cache. These variations between the diﬀerent GPU architectures consideredhere adds complexity when attempting to directly compare the actual measured performance of acompute task.The Nvidia Titan V test system used in this paper contained two Intel Xeon Gold 6128 CPU@ 3.40GHz CPUs, running Ubuntu 18.04 and CUDA 9.1. The Nvidia V100 system was run viaan Amazon Web Services (AWS) instance on a system containing an Intel Xeon CPU E5-2686v4@ 2.30GHz CPU, running Ubuntu 16.04 and CUDA 10.2. The AMD Radeon TM VII system usedhere contained an Intel Xeon CPU E5-1650 v3 @ 3.50GHz CPU, running Ubuntu 18.04 and ROCm3.5. The AMD Radeon Instinct TM MI60 system was accessed as a node of the Lawrence LivermoreNational Laboratory (LLNL) Corona cluster and contained an AMD EYPC TM . . . . . . . . . B a nd w i d t h ( G B / s ) V100Titan VRadeon VIIMI60Figure 1: Results for BS1. Colored solid lines show the measured bandwidth for the Nvidia Titan V,Nvidia V100, AMD Radeon TM VII, and AMD Radeon Instinct TM MI60. Colored dashed lines showthe performance models for each of the GPU accelerators.12 0 . . . . . . . . . B a nd w i d t h ( G B / s ) V100Titan VRadeon VIIMI60Figure 2: Results for BS2. Colored solid lines show the measured bandwidth for the Nvidia Titan V,Nvidia V100, AMD Radeon TM VII, and AMD Radeon Instinct TM MI60. Colored dashed lines showthe performance models for each of the GPU accelerators.0 0 . . . . . . . . . B a nd w i d t h ( G B / s ) V100Titan VRadeon VIIMI60Figure 3: Results for BS3. Colored solid lines show the measured bandwidth for the Nvidia Titan V,Nvidia V100, AMD Radeon TM VII, and AMD Radeon Instinct TM MI60. Colored dashed lines showthe performance models for each of the GPU accelerators.We begin with the BS1-5 test results shown in Figures 1-5. In each ﬁgure, we plot the eﬀectivebandwidth achieved by each GPU device on the y axis in GB/s with total data moved by the test13 0 . . . . . . . . . B a nd w i d t h ( G B / s ) V100Titan VRadeon VIIMI60Figure 4: Results for BS4. Colored solid lines show the measured bandwidth for the Nvidia Titan V,Nvidia V100, AMD Radeon TM VII, and AMD Radeon Instinct TM MI60. Colored dashed lines showthe performance models for each of the GPU accelerators.0 0 . . . . . . . . . B a nd w i d t h ( G B / s ) V100Titan VRadeon VIIMI60Figure 5: Results for BS5. Colored solid lines show the measured bandwidth for the Nvidia Titan V,Nvidia V100, AMD Radeon TM VII, and AMD Radeon Instinct TM MI60. Colored dashed lines showthe performance models for each of the GPU accelerators.14 0 . . . . B a nd w i d t h ( G B / s ) N=1N=2N=3N=4N=5N=6N=7 (a) AMD Radeon Instinct TM MI60. . . . . B a nd w i d t h ( G B / s ) N=1N=2N=3N=4N=5N=6N=7 (b) Nvidia V100. . . . . B a nd w i d t h ( G B / s ) N=1N=2N=3N=4N=5N=6N=7 (c) AMD Radeon TM VII. . . . . B a nd w i d t h ( G B / s ) N=1N=2N=3N=4N=5N=6N=7 (d) Nvidia Titan V.

Figure 6: Results for BS6. Colored solid lines in each sub-ﬁgure show the measured bandwidth forpolynomial degrees N = 1 , . . . , x axis. To collect the data for each BS1-5 test, we record the achieved bandwidth byrunning the relevant compute kernel 20 times and dividing the total amount of data moved by thetotal time to run the 20 kernel invocations. This helps to reduce small variations in each individualkernel run time. For each plot, we run the relevant benchmark for over 400 diﬀerent array sizesusing each GPU device and plot the achieved bandwidth in GB/s, for each device. The data foreach GPU appears as a generally smooth curve in each of the plots.In each of the BS1-5 plots in Figures 1-5, though it is often diﬃcult to perceive, we also plotthe model ﬁt in (9) for the measured bandwidth of each GPU device as a colored dashed line. Themodel ﬁt is diﬃcult to perceive because it ﬁts the experimental data extremely well and is oftenlaying directly on top of the curve made by the experimental measurements. To ﬁt the performancemodel (9) to the experimental data, we do a simple least squares linear ﬁt of the elapsed time of eachtrial, eﬀectively ﬁnding the T and W max parameters through the linear model (8). We summarize15 0 . . . . B a nd w i d t h ( G B / s ) N=1N=2N=3N=4N=5N=6N=7 (a) AMD Radeon Instinct TM MI60. . . . . B a nd w i d t h ( G B / s ) N=1N=2N=3N=4N=5N=6N=7 (b) Nvidia V100. . . . . B a nd w i d t h ( G B / s ) N=1N=2N=3N=4N=5N=6N=7 (c) AMD Radeon TM VII. . . . . B a nd w i d t h ( G B / s ) N=1N=2N=3N=4N=5N=6N=7 (d) Nvidia Titan V.

Figure 7: Results for BS7. Colored solid lines in each sub-ﬁgure show the measured bandwidth forpolynomial degrees N = 1 , . . . ,

7. The single dashed line in each plot shows the performance modelﬁt for N = 7.each of the model parameters for each of the BS1-7 tests and for each device in Table 3 at the endof this section.Some general trends can be observed across Figures 1-5. First, both the Nvidia V100 and TitanV are observed to have a spike in measured bandwidths at small transfer sizes. The peaks of thesespikes precisely correspond to the size of the L2 cache on both devices, and manifest because of theconsecutive runs of the kernel in each tests. It is interesting to note that this cache eﬀect is mostpronounced in BS1 and BS2, and is signiﬁcantly less pronounced in BS3, BS4, and BS5, indicatingthat there may be some sort of cache invalidation occurring by BS3-5 launching two distinct kernelsin each iteration. By contrast, despite their comparably sized L2 cache to the Nvidia Titan V, theAMD Radeon TM VII and Radeon Instinct TM MI60 appear to show no analogous cache eﬀects. Thissuggests the L2 cache is invalidated between each kernel launch on the AMD GPUs.16n general, each of the Nvidia V100, AMD Radeon TM VII, and AMD Radeon Instinct TM MI60achieve an asymptotic maximum bandwidth near 800 GB/s, while the Nvidia Titan V achievesslightly more than 600 GB/s due to its smaller number of HBM2 memory stacks. The curiousexception to this is in BS4 where the Nvidia V100 achieves a signiﬁcantly higher asymptotic max-imum bandwidth than it does in either BS3 or BS5, which are similar kernels to BS4. The AMDRadeon TM VII also performs slightly worse in BS4 and BS5, achieving only near 750 GB/s asymptoticmaximum bandwidth.While the AMD Radeon TM VII and Radeon Instinct TM MI60 achieve an asymptotic maximumbandwidth near or higher than the Nvidia V100 in each of our BS tests, it takes signiﬁcantly moredata being moved by each GPU to achieve near their maximum bandwidths. This is captured inour performance model (9) by a larger ﬁxed launch cost T . In the summarized model parametersgiven in Table 3, we see that the ﬁt for T in BS1 and BS2 for the AMD Radeon TM VII and RadeonInstinct TM MI60 is roughly 9 µ s, and nearly triple that of the Nvidia V100 model’s T values. Re-calling the deﬁnition of the Fischer-Brown point B . in (10) (i.e., the amount of data necessaryto transfer to achieve 80% of maximum bandwidth), we see that the larger T parameters directlyimpacts the size of B . , requiring more data to be moved to achieve 80% eﬃciency. For example,for the norm calculation in BS3, the Fischer-Brown point for the Nvidia V100 is B . = 24 MB,which for BS3 implies the vector being streamed must be at least 3.16M doubles. By comparison,the Fischer-Brown point for the AMD Radeon Instinct TM MI60 is B . = 55 . N = 1 , . . . , N over severalmesh sizes. The meshes are chosen to be structured K × K × K meshes of hexahedral elements, forvarious values of K , beginning at K = 2. Though the meshes are chosen to be structured, we makeno use of that knowledge in the mesh data structures, nor in the scatter and gather operators, Z and Z T , respectively. For each mesh size and polynomial order, we plot the measured bandwidthfor the test as a function of the total amount of data moved in the kernel on the x axis. We alsoplot in each ﬁgure the ﬁt performance model for the N = 7 data. We see in each ﬁgure that theperformance of the gather and scatter kernels follow roughly the same performance trend for eachconsidered polynomial order. The exception to this is the case of N = 1 and N = 2 in BS6, wherethe performance of the AMD Radeon TM VII and Radeon Instinct TM MI60 along with the NvidiaTitan V appears to be suboptimal. The Nvidia V100, on the other hand, performs consistently forall considered polynomial orders, potentially due to its large cache sizes compared with the otherGPUs.

In this paper, we have proposed several new benchmark tests relevant to iterative linear solversused in high-order ﬁnite element methods. In particular, our tests are inspired by several operationsnecessary in a Conjugate Gradient iterative method for hexahedral ﬁnite element meshes, as suchproblems are proposed and considered by the CEED co-design center [2] as part of the ExascaleComputing Project. We name these tests the CEED Benchmark Streaming tests, after the CEEDBenchmark Problems from which they are inspired. Each of these new benchmarks is of low arith-metic intensity, and is thus ideally limited only by how quickly data can be read from and writtento a compute device’s main memory.To demonstrate the utility of these benchmarks, we have implemented them using the OCCA17MD Radeon TM VII Nvidia TitanV AMD MI60 Nvidia V100Test T W max T W max T W max T W max BS1 8.87 870 2.49 601 7.95 846 2.90 811BS2 9.09 836 2.71 620 8.73 821 2.88 846BS3 17.73 796 5.24 619 16.99 843 7.62 809BS4 17.62 747 5.96 639 16.69 831 7.23 879BS5 13.70 741 5.07 608 18.35 810 5.56 810BS6* 12.97 765 5.32 595 15.32 724 6.00 769BS7* 9.08 778 1.77 563 14.19 735 3.70 787Table 3: Summary of model ﬁtting results for ﬁxed kernel launch cost ( T given in µs ) and asymptoticbandwidth ( W max given in GB/s). The ∗ on BS6 and BS7 is to indicate that the model ﬁt for thesetests is for only polynomial degree N = 7.abstraction framework [13] in order to test the performance behavior of four separate GPU ac-celerators from two major manufacturers, namely the Titan V and V100 GPUs manufactured byNvidia, and the Radeon TM VII and Radeon Instinct TM MI60 GPUs manufactured by AMD. In ourcomputational tests, we ﬁnd that, despite their architectural diﬀerences, we are able to obtain ahigh percentage of the peak bandwidth of each device’s HBM memory in each of our tests. Wealso propose a simple model for each device’s performance in each test which we demonstrate veryaccurately captures the realized performance behavior. Fitting this performance model using theBenchmark Streaming tests helps provide a method for predicting real performance in streamingoperations at diﬀerent problem sizes, which is helpful for predicting behaviors of applications whenscaling to many processors.This paper has only focused on low arithmetic intensity streaming operations. In a followingwork we will consider the full CEED Benchmark Problems and demonstrate both high performanceas well as portability between multiple GPU architectures and manufacturers.

Acknowledgments

This research was supported in part by the Exascale Computing Project, a collaborative eﬀort oftwo U.S. Department of Energy organizations (Oﬃce of Science and the National Nuclear SecurityAdministration) responsible for the planning and preparation of a capable exascale ecosystem, in-cluding software, applications, hardware, advanced system engineering, and early testbed platforms,in support of the nation’s exascale computing imperative.This research was also supported in part by the John K. Costain Faculty Chair in Science atVirginia Tech.AMD, the AMD Arrow logo, Radeon, Radeon Instinct, EYPC, and combinations thereof aretrademarks of Advanced Micro Devices, Inc. OpenCL is a trademark of Apple Inc. used by per-mission by Khronos Group, Inc. Other product names used in this publication are for identiﬁcationpurposes only and may be trademarks of their respective companies.

References [1] Claudio Canuto, M Yousuﬀ Hussaini, Alﬁo Quarteroni, A Thomas Jr, et al.

Spectral methodsin ﬂuid dynamics . Springer Science & Business Media, 2012.182] Center for Eﬃcient Exascale Discretizations (CEED) Benchmark Problems, 2017.[3] N. Chalmers, A. Karakus, A. P. Austin, K. Swirydowicz, and T. Warburton. libParanumal: aperformance portable high-order ﬁnite element library, 2020. Release 0.3.2.[4] N. Chalmers and T. Warburton. streamParanumal: Portable CEED benchmark streamingtests, 2020. Release 1.0.0.[5] Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S Meredith, Philip C Roth, KyleSpaﬀord, Vinod Tipparaju, and Jeﬀrey S Vetter. The scalable heterogeneous computing(SHOC) benchmark suite. In

Proceedings of the 3rd Workshop on General-Purpose Compu-tation on Graphics Processing Units , pages 63–74, 2010.[6] Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. Evaluating attainablememory bandwidth of parallel programming models via Babelstream.

International Journal ofComputational Science and Engineering , 17(3):247–262, 2018.[7] Michel O Deville, Paul F Fischer, and Ernest H Mund.

High-order methods for incompressibleﬂuid ﬂow , volume 9. Cambridge university press, 2002.[8] Paul Fischer, Misun Min, Thilina Rathnayake, Som Dutta, Tzanio Kolev, Veselin Dobrev,Jean-Sylvain Camier, Martin Kronbichler, Tim Warburton, Kasia Swirydowicz, and Jed Brown.Scalability of high-performance PDE solvers. arXiv preprint arXiv:2004.06722 , 2020.[9] Joseph L Greathouse and Mayank Daga. Eﬃcient sparse matrix-vector multiplication on GPUsusing the CSR storage format. In

SC’14: Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis , pages 769–780. IEEE, 2014.[10] Ronald D Henderson and George Em Karniadakis. Unstructured spectral element methods forsimulation of turbulent ﬂows.

Journal of Computational Physics , 122(2):191–217, 1995.[11] Martin Karp, Niclas Jansson, Artur Podobas, Philipp Schlatter, and Stefano Markidis. Opti-mization of tensor-product operations in Nekbone on GPUs. arXiv preprint arXiv:2005.13425 ,2020.[12] Elias Konstantinidis and Yiannis Cotronis. A quantitative rooﬂine model for GPU kernelperformance estimation using micro-benchmarks and hardware metric proﬁling.

Journal ofParallel and Distributed Computing , 107:37–56, 2017.[13] David S Medina, Amik St-Cyr, and Tim Warburton. OCCA: A uniﬁed approach to multi-threading languages. arXiv preprint arXiv:1403.0968 , 2014.[14] Kasia ´Swirydowicz, Noel Chalmers, Ali Karakus, and Tim Warburton. Acceleration of tensor-product operations for high-order ﬁnite element methods.