[PDF] Scalability of High-Performance PDE Solvers

Abstract

Performance tests and analyses are critical to effective HPC software development and are central components in the design and implementation of computational algorithms for achieving faster simulations on existing and future computing architectures for large-scale application problems. In this paper, we explore performance and space-time trade-offs for important compute-intensive kernels of large-scale numerical solvers for PDEs that govern a wide range of physical applications. We consider a sequence of PDE- motivated bake-off problems designed to establish best practices for efficient high-order simulations across a variety of codes and platforms. We measure peak performance (degrees of freedom per second) on a fixed number of nodes and identify effective code optimization strategies for each architecture. In addition to peak performance, we identify the minimum time to solution at 80% parallel efficiency. The performance analysis is based on spectral and p-type finite elements but is equally applicable to a broad spectrum of numerical PDE discretizations, including finite difference, finite volume, and h-type finite elements.

Full PDF

SScalability of High-Performance PDE Solvers

Paul Fischer,

Misun Min, Thilina Rathnayake, Som Dutta, Tzanio Kolev, Veselin Dobrev, Jean-Sylvain Camier, Martin Kronbichler, Tim Warburton, Kasia´Swirydowicz, Jed Brown Abstract

Performance tests and analyses are critical to effective HPC software development and are centralcomponents in the design and implementation of computational algorithms for achieving faster simulationson existing and future computing architectures for large-scale application problems. In this paper, we exploreperformance and space-time trade-offs for important compute-intensive kernels of large-scale numericalsolvers for PDEs that govern a wide range of physical applications. We consider a sequence of PDE-motivated bake-off problems designed to establish best practices for efﬁcient high-order simulations across avariety of codes and platforms. We measure peak performance (degrees of freedom per second) on a ﬁxednumber of nodes and identify effective code optimization strategies for each architecture. In addition to peakperformance, we identify the minimum time to solution at 80% parallel efﬁciency. The performance analysis isbased on spectral and p -type ﬁnite elements but is equally applicable to a broad spectrum of numerical PDEdiscretizations, including ﬁnite difference, ﬁnite volume, and h -type ﬁnite elements. Keywords

High-Performance Computing, Strong-Scale Limit, n . ( n sub 0.8), High-Order Discretizations, PDEs One characteristic common to many science and engi-neering applications in high-performance computing(HPC) is their enormous range of spatial and temporalscales, which can manifest as billions of degrees offreedom (DOFs) to be updated over tens of thousandsof timesteps. Typically, the large number of spatialscales is addressed through parallelism—processingall elements of the spatial problem simultaneously—while temporal complexity is most efﬁciently addressedsequentially, particularly in the case of highly nonlinearproblems such as ﬂuid ﬂow, which defy linear trans-formations that might expose additional parallelism(e.g., through frequency-domain analysis). Simulationcampaigns for these problems can require weeks ormonths of wall-clock time on the world’s fastest super-computers. One of the principal objectives of HPC is toreduce these runtimes to manageable levels.In this paper, we explore performance and space-time trade-offs for computational model problemsthat typify the important compute-intensive kernels oflarge-scale numerical solvers for partial differentialequations (PDEs) that govern a wide range of physicalapplications. We are interested in peak performance(degrees of freedom per second) on a ﬁxed number ofnodes and in minimum time to solution at reasonableparallel efﬁciencies, as would be experienced bycomputational scientists in practice. While we considermatrix-free implementations of p -type ﬁnite andspectral element methods as the principal vehicle for our study, the performance analysis presented here isrelevant to a broad spectrum of numerical PDE solvers,including ﬁnite difference, ﬁnite volume, and h -typeﬁnite elements, and thus is widely applicable.Performance tests and analyses are critical toeffective HPC software development and are centralcomponents of the recently formed U.S. Departmentof Energy Center for Efﬁcient Exascale Discretizations(CEED) ∗ . One of the foundational components ofCEED is a sequence of PDE-motivated bake-off Mathematics and Computer Science, Argonne National Labora-tory, Lemont, IL 60439 Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801 Department of Mechanical Science and Engineering, Universityof Illinois at Urbana-Champaign, Urbana, IL 61801 Center for Applied Scientiﬁc Computing, Lawrence LivermoreNational Laboratory, Livermore, CA 94550 Institute for Computational Mechanics, Technical University ofMunich, 85748 Garching b. Muenchen, Germany Department of Computer Science, University of Colorado, Boul-der, CO 80309 National Renewable Energy Laboratory, Lakewood, CO, 80401 Department of Mathematics, Virginia Tech, Blacksburg, VA 24061 Mechanical & Aerospace Engineering, Utah State University, UT84322

Corresponding author:

Misun Min, Mathematics and Computer Science Division, ArgonneNational Laboratory, Lemont, IL 60439Email: [email protected] ∗ CEED: https://ceed.exascaleproject.org a r X i v : . [ c s . PF ] A p r problems designed to establish best practices forefﬁcient high-order methods across a variety ofplatforms. The idea is to pool the efforts of several high-order development groups to identify effective codeoptimization strategies for each architecture. Our ﬁrstround of tests features comparisons from the softwaredevelopment projects Nek5000 † , MFEM ‡ , deal.II § ,and libParaNumal ¶ . Principal ﬁndings are that high-order operator evaluation and solvers can in fact be less expensive per degree-of-freedom than their low-order counterparts and that strong scaling of CPU-onlyplatforms can yield lower time per iteration than evenhighly-tuned kernels on GPU nodes. We note, however,that other important metrics, such as energy and capitalcosts, are not considered in this evaluation.The rest of this paper is organized as follows. Section2 provides an overview of the performance metrics thatwe use throughout the paper and also gives motivationfor the suite of bake-off problems. Section 3 describesthe bake-off problem (BP) speciﬁcations. Sections 4and 5 discuss the detailed mathematical formulations.Sections 6 and 7 demonstrate performance results andanalysis for the three production codes on IBM’s BG/Qat Argonne Leadership Computing Facility (ALCF). InSection 8, we present results using NVIDIA V100son the Summit at Oak Ridge Leadership ComputingFacility (OLCF), provided with discussion and furtheranalysis of these results. We summarize our ﬁndings inSection 9. Details regarding the code implementationsare provided in the Appendix. Scalability is an important metric when assessingperformance of parallel computing applications. Anopen question remains, however, of how one shouldquantify scalability. Is it best to study strong scaling,in which the problem size n (say, number of gridpoints) is ﬁxed and the number of cores P c is increased?Or is weak scaling, in which n/P c is ﬁxed while P c increases, better? The concern with each of theseoptions is that they often fail to reveal performance atthe level of granularity that is of paramount interest toHPC users, namely, the point where parallel efﬁciencystarts to drop below a tolerable fraction η ∈ [0 , . Werefer to this tolerable limit as the η -performance or strong-scale limit, with typical values of η = 0 . or 0.8corresponding respectively to 50% and 80% parallelefﬁciency.As an example, we show in Figures 1 and 2 strong-scale performance of Nek5000 using 32 MPI ranks pernode on Cetus, the development-level IBM BG/Q atALCF. The model problem is a 3D Poisson equationsolved at different spectral element resolutions, whichare described in Section 3 as bake-off problem number5 (BP5). With polynomial order p = 7 and numberof elements E = 2 and E = 2 , the number ofpoints are n = n = p E throughout the paper, without redundancy and withoutboundary condition effects. In other words, n is thetotal number of parallel unique degrees of freedom,including all boundary unknowns. The polynomialorder p is ﬁxed for each direction in the referenceelement ˆΩ = [ − , .Figure 1(a) shows the time versus the number ofMPI ranks, P c , for each case of ﬁxed n = 22 M and n = 5 . M , along with ideal linear scaling plotsscaling as P − c . The lower curve, corresponding to thesmaller problem, exhibits linear scaling out to P c =2048 , beyond which the time decreases more slowly.In contrast, scaling for the larger problem is linearout to P c = 8192 before the performance drop-off isobserved. In isolation, one might conclude that the n =22 , , performance is indicative of good strong-scale behavior. By contrast, the n = 5 , , graphmight be viewed with skepticism regarding the abilityof the code to strong-scale beyond P c = 2048 .In fact, if we change the x -axis in Figure 1(a) to be n/P c , we see that the large and small problem resultscollapse to a single curve, as evident in Figure 1(b).In this case, for n/P c ≥ the solution time isdirectly proportional to the amount of work on eachrank, namely, t wall ∼ n/ ( S P c ) , (1)where S is a constant that scales as the rate of work perrank in this work-saturated limit. From this result, wecan infer that we have order unity parallel efﬁciencyfor n/P c ≥ , meaning that for any case with n/P c > one can effectively increase the numberof processors with a corresponding decrease in timeto solution, all at ﬁxed energy and cost , assuming thatthe machine has enough processors to do so. This isthe important promise of distributed-memory parallelcomputing. We often refer to such a break point (e.g., n/P c = 2744 in Figure 1(b)) as the strong-scale limittowards which users will naturally gravitate. Runningwith n/P c signiﬁcantly exceeding this limiting valueimplies unnecessarily long runtimes. In practice, usersmight choose a value of n/P c below this break point ifthey are willing to tolerate a certain level of inefﬁciency.To this end, we deﬁne parallel efﬁciency, η ( P ) := T P min × P min T P × P , (2)where P min is the minimum number of processorsused in the strong-scale study (usually, the memory-saturation-bound limit) and T P is the wall-clock timefor P processors. (Here, processors means cores, nodes, † https://github.com/Nek5000/Nek5000 ‡ https://github.com/mfem/mfem § https://github.com/dealii/dealii ¶ https://github.com/paranumal/libparanumal Number of MPI ranks / P c − − T i m e f o r i t e r a t i o n s ( s ec ) Nek5000 BP5 (a)

BP5: time vs. P c Number of points per rank − − T i m e f o r i t e r a t i o n s ( s ec ) Nek5000 BP5 (b)

BP5: time vs. n/P c Figure 1.

Strong-scale study for BP5-Nek5000 with n = 22M and 5.6M. n/P c is the problem size per core, andstrong-scale limit is observed at n/P c = 2744 . Number of points per rank . . . . . . . P a r a ll e l E ﬃ c i e n c y Nek5000 BP5 (a)

BP5: efﬁciency vs. n/P c Number of points per rank ( D O F x I t e r ) / ( N o d e s x S ec ) × Nek5000 BP5 (b)

BP5: DOFs vs. n/P c Figure 2.

Strong-scale study for BP5-Nek5000. n/P c is the problem size per core. Order unity parallel efﬁciency can beachieved for n/P c ≥ . GPUs, etc., according to the sensible metric for thegiven platform.) Figure 2(a) shows the correspondingefﬁciencies for the pair of strong-scale studies fromFigure 1. We see that η ≈ . for n/P c = 1372 and η ≈ . for n/P c = 686 . Here, we denote n . = 1372 and n . = 686 as the values of n/P c that correspondto 80% and 50% efﬁciency, respectively.We note that in leadership-class computing envi-ronments the granularity of the simulation, n/P , is aruntime parameter selectable by the user unless P isbounded by the total number of processors accessiblein the system. To minimize solution time, users aregenerally interested in operating as far to the left aspossible in Figure 2(a) while staying within a tolerableparallel efﬁciency, η . Beyond the strong-scale limit n/P c = 2744 discussed previously, one could have a1.6 × speedup by doubling the number of processors(which might be reasonable for long simulations) or a2 × speedup by quadrupling the number of processors(which likely is not acceptable to most users). For cross-code comparisons, one must be able tocompare rates of work between differing implementa-tions. Here, we retain the independent variable, n/P ,but change the dependent variable to be workresources = iterations × DOFsnodes × seconds =: DOFs , (3)where iterations is the number of conjugate gradient(CG) iterations (see the BP deﬁnitions in Section 3); DOFs is the number of degrees of freedom (here,grid points); nodes is the number of BG/Q nodes;and seconds refers to the amount of wall-clocktime required to perform the given number ofCG iterations. (For the bake-off comparisons, theindependent variable, n/P , is taken to be the numberof points per BG/Q node.)With this background, we have developed a sequenceof bake-off problems to generate performance ﬁguressimilar to Figure 2(b) for a variety of polynomialorders, element counts, operators, and codes (over2,000 simulations in total). The principal objective is to identify the fastest implementations for each operator,at the given resolution (e.g., polynomial order p ).Equally important, however, is to identify the peakperformance value and the strong-scale limit whereperformance efﬁciency drops to a fraction η = 0 . ofthis peak value. For the bake-off problems, the numberof MPI ranks is ﬁxed to be a large value (generally, P =512 nodes, for a total of 16,384 ranks with -c32 modeon BG/Q, using 2 MPI processes per core and 16 coresper node); and the independent variable is the numberof degrees of freedom per node, n/P = ( p E ) /P . The ﬁrst suite of problems is focused on runtime perfor-mance, measured in DOFs for bake-off kernels (BKs)and bake-off problems (BPs). The BK s are deﬁned asthe application of a local (unassembled) ﬁnite-elementoperator to a scalar or vector ﬁeld, without the commu-nication overhead associated with assembly. These testsessentially demonstrate the vectorization performanceof the PDE operator evaluation (i.e., matrix-free matrix-vector products, or matvecs). As they represent themost computationally-intensive operations in bake-offproblems, the BKs provide an upper bound on realiz-able ﬂoating-point performance (DOFs or MFLOPS)for an application based on such implementations. The BP s are mock-up solvers that use the BKs inside adiagonally preconditioned conjugate gradient (PCG)iteration. The point of testing the BKs inside PCG is toestablish realistic memory-access patterns, to accountfor communication overhead of assembly (i.e., nearest-neighbor exchange), and to include some form of globalcommunication (i.e., dot products) as a simple surro-gate for global coarse-grid solves that are requisite forlarge-scale linear system solves.To meet these goals in a measurable way, we decidedto run the initial BPs on 16,384 MPI ranks (i.e., 512nodes in -c32 mode) of the IBM BG/Q Cetus at theALCF, which, by virtue of its convex network partitionsand provision for a 17th “OS” core on each node,realizes timings that are repeatable to 3 digits fromone job submission to the next. The number of MPIranks was deemed large enough to reveal obviousstrong-scale deﬁciencies yet small enough to allowminimal queue times for all the cases that needed tobe submitted. The range of problem sizes was chosento span from the performance-saturated limit (a lot ofwork per node) to beyond the strong-scale limit (verylittle work per node).To date, there are six BKs and six BPs. KernelsBK1, BK3, and BK5 operate on scalar ﬁelds,while BK2, BK4, and BK6 are correspondingoperators applied to vector ﬁelds having threecomponents each. Each BP j , j = 1 , . . . , correspondsto PCG applied to the assembled BK j system,using a matrix-free implementation if that is faster.The even-numbered (vector-oriented) kernels allow amortization of matrix-component memory referencesacross multiple operands and provide a realisticoptimization for vector-based applications such asﬂuid dynamics (three ﬁelds) and electromagnetics (sixﬁelds). The vector-based implementations can alsobeneﬁt from coalesced messaging, thus reducing theoverhead of message latency, which is important whenrunning in the strong-scale (fast) limit.The BPs include solution of the mass matrix, Bu = r (BP1–BP2), and the stiffness matrix, Au = r (BP3–BP6). Approximation orders are p = 1 , . . . , . Foreach problem, q -point quadrature is used in eachdirection within the reference element (for a totalof q E quadrature points throughout the domain).For BK1–BK4, Gauss-Legendre (GL) quadrature isused with q = p + 2 . For BK5–BK6, Gauss-Lobatto-Legendre (GLL) quadrature is used with q = p + 1 quadrature points corresponding to the nodal pointsof the underlying Lagrangian bases. The nodal andquadrature point choices as well as the boundaryconditions (BCs) are summarized in Table 1. Thenumber of elements is E = 2 k , for k = 14 , . . . , ( = 16 , and = 2 , , ) so that at leastone element per MPI rank is ensured. The elements arearranged in a tensor-product array with the number ofelements in each direction differing by at most a factorof 2. Since the tests are designed to mimic real-worldapplications, the benchmark codes assume that theelements are full curvilinear elements, approximatedby the same order of polynomial, p , with iso-parametricmappings and are not allowed to exploit the globaltensor-product structure of the element layout. We alsonote that the geometric matrices are all precomputed inthe codes. We can cast the BP speciﬁcations in the context of thefollowing scalar Helmholtz equation, − ∇ · µ ∇ u + β u = f in Ω , (4)where β and µ may be nonnegative functions of x ∈ Ω . With X p a ﬁnite-dimensional subset of H , theSobolev space comprising square-integrable functionson Ω whose gradient is also square integrable, thediscrete variational formulation of (4) is Find u ∈ X p such that a ( v, u ) + ( v, u ) β = ( v, f ) ∀ v ∈ X p , (5)where, for sufﬁciently regular v , u , and f , we have a ( v, u ) = (cid:90) Ω µ ∇ v · ∇ u d x , (6) ( v, u ) β = (cid:90) Ω v β u d x , (7) ( v, f ) = (cid:90) Ω v f d x . (8) Table 1.

Bake-Off Kernel/Problem Summary

System Form BCs Quadrature Points Nodal Points

BK1/BP1 Bu = r scalar homogeneous Neumann ( p + 2 ) GL ( p + 1 ) GLL BK2/BP2 Bu i = r i vector homogeneous Neumann ( p + 2 ) GL ( p + 1 ) GLL BK3/BP3 Au = r scalar homogeneous Dirichlet ( p + 2 ) GL ( p + 1 ) GLL BK4/BP4 Au i = r i vector homogeneous Dirichlet ( p + 2 ) GL ( p + 1 ) GLL BK5/BP5 Au = r scalar homogeneous Dirichlet ( p + 1 ) GLL ( p + 1 ) GLL BK6/BP6 Au i = r i vector homogeneous Dirichlet ( p + 1 ) GLL ( p + 1 ) GLL We approximate the scalar functions u and v by ﬁniteexpansion of nodal (Lagrangian) basis functions u ( x ) = n (cid:88) l =1 u l φ l ( x ) , (9) v ( x ) = n (cid:88) ˆ l =1 v ˆ l φ ˆ l ( x ) . (10)Insertion of (9)–(10) into (6)–(8) leads to the innerproducts deﬁned as a ( v, u ) = v T Au, (11) ( v, u ) β = v T Bu, (12) ( v, f ) = v T b. (13)Here, we have introduced the stiffness matrix , A , the(weighted) mass matrix , B , and the right-hand side, b , A ˆ ll = a ( φ ˆ l , φ l ) , (14) B ˆ ll = ( φ ˆ l , φ l ) β , (15) b ˆ l = ( φ ˆ l , f ) , (16)where the index sets are l, ˆ l ∈ { , . . . , n } . The systemto be solved is thus Hu = b, where H := A + B. (17)BPs 1–2 correspond to µ = 0 ( A = 0 ). BPs 3–6correspond to β = 0 ( B = 0 ).The system (17) denotes a generic Galerkindiscretization of (5). The point of departure for theﬁnite element/spectral element formulation is in thechoice of basis functions, φ i , and choice of quadraturerules for the inner products (6)–(8). With the precedingdeﬁnitions, we now consider the fast tensor-productevaluations of the inner products introduced in (11)–(12). To begin, we assume Ω = ∪ Ee =1 Ω e , where thenon-overlapping subdomains (elements) Ω e are imagesof the reference domain, ˆΩ = [ − , , given by x | Ω e = x e ( r, s, t )= p (cid:88) k =0 p (cid:88) j =0 p (cid:88) i =0 x eijk h i ( r ) h j ( s ) h k ( t ) . (18)Here, the h i s are assumed to be the Lagrangeinterpolation polynomials based on the Gauss-Lobatto-Legendre (GLL) quadrature points, ξ j ∈ [ − , , j =0 , . . . , p . This choice of points yields well-conditioned operators and affords the option of direct GLLquadrature, if desired.All functions are assumed to have expansions similarto (18). For example, the solution u ( x ) on Ω e takes theform u | Ω e = p (cid:88) k =0 p (cid:88) j =0 p (cid:88) i =0 u eijk h i ( r ) h j ( s ) h k ( t ) , (19)for the ( p + 1) unknown basis coefﬁcients in Ω e .To ensure interelement continuity ( u, v ∈ X p ⊂H ), one must constrain coefﬁcients at shared elementinterfaces to be equal. That is, for any given sets ofindex coefﬁcients ( i, j, k, e ) and (ˆ ı, ˆ , ˆ k, ˆ e ) , x eijk = x ˆ e ˆ ı ˆ  ˆ k −→ u eijk = u ˆ e ˆ ı ˆ  ˆ k . (20)The statement (20) leads to the standard ﬁnite elementprocesses of matrix assembly and assembly of the loadand residual vectors. Fast matrix-free algorithms thatuse iterative solvers require assembly only of vectors,since the matrices are never formed. To implement theconstraint (20), we introduce a global-to-local map ,formally expressed as a sparse matrix-vector product, u L = Qu , which takes (uniquely deﬁned) degrees offreedom u l from the index set l ∈ { , . . . , n } to their(potentially multiply deﬁned) local counterparts u eijk .We further deﬁne A L = block-diag ( A e ) , and arrive atthe assembled stiffness matrix A = Q T A L Q. (21)We refer to A L as the unassembled stiffness matrixand its constituents A e as local stiffness matrices.With this factored form, a matrix-vector product canbe evaluated as w = Q T A L Qu , which allows parallelevaluation of the work-intensive step of applying A e to basis coefﬁcients in each element Ω e . Applicationof the Boolean matrices Q and Q T represents thecommunication-intensive phases of the process. (cid:107) Q p Formulations

Efﬁcient matrix-free formulations of Q p tensor-productspaces date back to Fourier-based spectral methodspioneered by Orszag and others in the early 1970s. (cid:107) In the case of nonconforming elements, Q is not Boolean but canbe factored into a Boolean matrix times a local interpolation matrix,Deville et al. (2002). In a landmark paper, Orszag (1980) showed that theprincipal advantage of Fourier spectral methods in lR derives from the tensor product forms (e.g., (19)),which reduce the cost of applying differential operatorsfor p degrees of freedom from O ( p ) to O ( p ) . Orszagfurther noted that if one exploits symmetries in theone-dimensional operators, the constant can potentiallybe improved; and, with enough symmetry, such asin the Fourier and Chebyshev cases, the asymptoticcomplexity can be reduced to O ( p log p ) . With theintroduction of the spectral element method (SEM),Patera (1984) extended Orszag’s work to a variationalframework using Chebyshev bases, and Rønquist andPatera (1987) developed the nodal-bases approachusing Gauss-Lobatto-Legendre quadrature that is nowcommonplace in the SEM and other high-order Q p implementations.With the use of iterative solvers, the bake-offproblems amount to implementing fast matrix-vectorproducts. As noted earlier, matrix-free evaluations areeffected with a communication phase for assembly anda local work-intensive phase to evaluate the physics. Forthe stiffness matrix, A , the local matrix vector products w e := A e u e can be implemented in the matrix-freeform outlined in 4.4.7 of Deville et al. (2002), w e =  D D D  T  G e G e G e G e G e G e G e G e G e  D D D  u e . (22)Here, the derivative matrices D j involve tensorproducts of the one-dimensional ( q × p ) interpolation( ˆ J ) and derivative ( ˆ D ) matrices: D = ˆ J ⊗ ˆ J ⊗ ˆ D , D = ˆ J ⊗ ˆ D ⊗ ˆ J , and D = ˆ D ⊗ ˆ J ⊗ ˆ J , where p = p + 1 is the number of nodal points in each directionwithin an element and q is the corresponding numberof quadrature points. For BP5 and BP6, we use GLLquadrature on the q = p nodal points, so ˆ J becomesthe q × q identity matrix, which yields considerablesavings in computational cost.Each of the geometric factors is a diagonal matrixof size q × q (i.e., comprising only q nontrivialentries). For each quadrature point ( η i , η j , η k ) ∈ ˆΩ , [ G emm (cid:48) ] ijk = (cid:34) (cid:88) l =1 ∂r m ∂x el ∂r m (cid:48) ∂x el (cid:35) ijk J eijk ρ i ρ j ρ k , (23)where J eijk is the geometric Jacobian evaluated at thequadrature points and the ρ j s are the GL quadratureweights for BP3–BP4 and GLL weights for BP5–BP6. G e is a symmetric tensor, G emm (cid:48) = G em (cid:48) m , so only q memory references per element are required for(22), in addition to the p reads required to load u e . ˆ J and ˆ D require only q × p reads, which are amortizedover multiple elements and therefore discounted in theanalysis.The majority of the computational effort in (22) is inapplication of the D j s. Thus, it is worthwhile to explore the complexity for these tensor contractions, which arecentral to fast matrix-free formulations. We considertwo of the many possible approaches. First, we needto evaluate for each element Ω e , u er = D u e = ( ˆ J ⊗ ˆ J ⊗ ˆ D ) u e (24) u es = D u e = ( ˆ J ⊗ ˆ D ⊗ ˆ J ) u e (25) u et = D u e = ( ˆ D ⊗ ˆ J ⊗ ˆ J ) u e , (26)each of which is a map from p nodes to q = γ p quadrature points, where γ := q/p .While the differentiation matrices D j are full, with γ p nonzeros, they can be applied in factored form astensor contractions with only O ( p ) storage and O ( p ) work. For example, (24) is typically implemented as u er = ( ˆ J ⊗ ˆ I ⊗ ˆ I )( ˆ I ⊗ ˆ J ⊗ ˆ I )( ˆ I ⊗ ˆ I ⊗ ˆ D ) u e , (27)with ˆ I an appropriately sized identity matrix. Moreexplicitly, (27) is evaluated for i, j, k ∈ { , . . . , q } as ( u er ) ijk = p (cid:88) ˆ k =0 ˆ J k ˆ k  p (cid:88) ˆ  =0 ˆ J j ˆ  (cid:32) p (cid:88) ˆ ı =0 ˆ D i ˆ ı u ˆ ı ˆ  ˆ k (cid:33) u e , (28)which has computational complexityW = 2( qp + q p + q p )= 2 p ( γ + γ + γ ) (29)operations. This cost is nominally repeated for (25)–(26), save for the common term ( ˆ I ⊗ ˆ I ⊗ ˆ J ) u e , whichdoes not need to be re-evaluated. We note that [ D T D T D T ] can be applied in a similar fashion, so thetotal work for (22) with this approach isW = 4 p (3 γ + 3 γ + 2 γ ) + 15 γ p . (30)The O ( p ) term arises from application of the × G emm (cid:48) tensor.An alternative approach to (24)–(26) is to ﬁrstinterpolate u to the quadrature points, ˜ u e = Ju e =( ˆ J ⊗ ˆ J ⊗ ˆ J ) u e , followed by differentiation u er = ( ˜ I ⊗ ˜ I ⊗ ˜ D )˜ u e (31) u es = ( ˜ I ⊗ ˜ D ⊗ ˜ I )˜ u e (32) u et = ( ˜ D ⊗ ˜ I ⊗ ˜ I )˜ u e , (33)where ˜ I and ˜ D are respectively q × q identity andderivative matrices on the one-dimensional array ofquadrature points. This strategy leads to a complexityof W = 2 p (3 γ + γ + γ + γ ) for the gradient andan overall complexity for (22) ofW = 4 p (3 γ + γ + γ + γ ) + 15 γ p . For γ = 1 , the second approach yields a reductionin work of ≈

25% over (28). By contrast, for γ =3 / (commonly used in evaluating nonlinear advectionterms), the second approach incurs roughly a 12%increase in operation count. The mass matrix has a similar tensor product form.Speciﬁcally, w e = B e u e = J T ˜ B e Ju e , (34)where ˜ B e = diag ( ρ i ρ j ρ k J eijk ) is the diagonal massmatrix on the quadrature points and J = ˆ J ⊗ ˆ J ⊗ ˆ J isthe interpolation operator that maps the basis functionsto the quadrature points. Application of (34) has acomplexity of p ( γ + γ + γ ) operations and p (1 + γ ) reads from memory. (We remark that the spectralelement method uses a diagonal mass matrix on theGLL points, for overall work and memory complexityof only p per element for either the forward or inversemass matrix application.)While the tensor contractions have complexity of O ( p ) , every other operation is of order O ( p ) or lower(e.g., O ( p ) for surface operations). On traditionalarchitectures, the tensor contractions are effectivelyimplemented as dense matrix-matrix products (e.g.,Deville et al. (2002)). On highly threaded architectures,such as GPUs, other approaches that better exploit thetensor structure are often more effective because thestraightforward BLAS3 implementations do not exposesufﬁcient parallelism and data reuse. We revisit thispoint in Section 8.We further note that the operation count for all tensorcontractions can be roughly halved by exploiting thesymmetry of the GLL and GL point distributions, asnoted by Solomonoff (1992) and Kopriva (2009). Suchan approach is used in deal.II and in some of ourlibParanumal results for the Nvidia V100, as detailedin the Appendix. In addition to operation counts, onemust recognize that the kernel (BK) performance can beheavily inﬂuenced by whether q or p + 1 match cache-line sizes or vector-lane widths, which are typically 4or 8 words wide. Moreover, application of the work-intensive operators, D i in (22) and J in (34), is identicalfor each element (of the same order, p ) because theseoperators are applied in the reference domain ˆΩ . Onecan therefore vectorize over blocks of elements, ratherthan just applying the operators to a single element at atime. If the number of elements per rank is a multiple of(say) 4 or 8, one can easily arrange the data to exploitthis level of SIMD parallelism, assuming that there areenough elements per MPI rank.We close this section by noting that, implementedproperly, the tensor-product-based matrix-free formu-lation requires the optimal amount of memory transfers(with respect to the polynomial order) and near-optimalFLOPs for operator evaluation. The importance of theseoverhead costs is illustrated in Figure 3, which contraststhe matrix-vector product costs with fully assembledmatrices (in block compressed sparse row format) withmatrix-free approaches. polynomial order b y t e s / d o f polynomial order ﬂ o p s / d o f tensor b = 1 tensor b = 3 tensor-qstore b = 1 tensor-qstore b = 3 assembled b = 1 assembled b = 3 Figure 3.

Memory transfer and ﬂoating-point operationsper degree of freedom for a PDE in 3D with b componentsand variable coefﬁcients. The “tensor” computes metricterms on the ﬂy and stores a compact representation ofthe coefﬁcients at quadrature points; the “tensor-qstore”pulls the metric terms into the stored representation as in(23); the “assembled” uses a (block) CSR format. In this section, we present results for BP1–6 usingNek5000, MFEM, and deal.II. Nek5000 is an F77code developed at Argonne National Laboratory andoriginating from the Nekton 2 spectral elementcode written at M.I.T. by Rønquist (1988), Fischer(1989), and Ho (1989). MFEM is a general-purposeﬁnite element library written in C++ at LawrenceLivermore National Laboratory with the goal ofenabling research and development of scalable ﬁniteelement discretizations MFEM. The deal.II libraryof Arndt et al. (2017) is also a C++ code that originallyemerged from work at the Numerical Methods Groupat Universit¨at Heidelberg, Germany. The main focusof deal.II is the computational solution of partialdifferential equations using adaptive ﬁnite elements.Further algorithmic details for each code are describedin the Appendix.All simulations were performed on the ALCF BG/QCetus in -c32 mode. The BG/Q system conﬁgurationis shown in Table 2. The initial runs were based onthe GNU-based compiler, or brieﬂy gcc. On BG/Q,however, the performance of gcc is much lower thanthat of the native IBM XL (xlc/xlf) compiler, or brieﬂyxlc, so the battery of tests was rerun by using xlc,save for deal.II, which is unable to link properly withxlc version of Cetus. We present both the gcc and xlcresults.We measured the rate of work in DOFs (3). Theprincipal metrics of interest are • r max , the peak rate of work per unit resource, • n . , the local problem size n/P on the noderequired to realize 80% of r max , and • t . , the time per iteration when running at 80%of r max .Note that n . is deﬁned in terms of points and that r max (for any given p ) is the peak rate of work across allthe implementations. The importance of n . , r max , and Table 2.

System Conﬁguration

ALCF BG/Q (Cetus & Mira) OLCF SummitProcessor 16-core 1.6 GHz IBM PowerPC A2 IBM Power9 TM (2/node)Nodes 2,048 (Cetus), 49,152 (Mira) 4,608CPU Cores/node 16 42Total CPU Cores 32,768 (Cetus), 786,432 (Mira) 193,536Total GPUs – 27,648 NVIDIA Volta V100s (6/node)Memory/node 16 GB RAM 512 GB DDR4 + 96 GB HBM2Peak Performancec 10 PF (Mira) 42 TFInterconnect 5D Torus Proprietary Network Mellanox EDR 100F InﬁniBand, Non-blocking Fat Tree the scaled ratio, t . = 1 . n . /r max , is discussed inSection 7.All results are plotted as the rate of work per unitresource (3) versus the number of points per computenode, n/P . We used a ﬁxed iteration of 100. Tosimplify the notation, we will refer to the performancevariable—the y axis—as DOFs (or MDOFs, formillions of DOFs). Choosing the number of pointsper node as the independent variable on the x -axis,rather than number of DOFs per node, leads to a datacollapse in the case where components of vector ﬁeldsare computed independently: one obtains a single curvefor any number of independent components. Whenthe component systems are solved simultaneously, asin BP2, BP4, and BP6, beneﬁts such as increaseddata reuse or amortized messaging overhead shouldmanifest as shifts up and to the left in the performancecurves. We note that for the scalar problems (BP1, BP3,and BP5), the number of DOFs is equal to the numberof grid points.Figures 4–11 present the BP results using the gccand xlc compilers for Nek5000, MFEM, and deal.II. Ineach ﬁgure, each line represents a different polynomialorder. In all cases, performance is strongly tied to thenumber of gridpoints per node, n/P . In the case of thegcc compilers, Nek5000 and MFEM generally exhibit aperformance plateau as n/P increases, whereas deal.IIshows a distinct peak.In the discussion that follows, we focus primarilyon the saturated (i.e., peak observable) performancetoward the right side of the graphs. On the left side,performance levels drop off to uninteresting valuesthat users would generally never experience. Thislow-performance regime corresponds to relatively fewpoints per node and is easily avoided on distributed-memory platforms by using fewer processors. Whilethe deﬁnition is not precise, the point of rapidperformance roll-off represents the strong-scale limit towhich most users will gravitate in order to reduce timeper iteration. Operating to the right of this point wouldincur longer run times. This transition point is thus themost signiﬁcant part of the graph, and its identiﬁcationis an important part of the BP exercise. A convenientdemarcation is n . , which indicates the number ofpoints per node where the performance is 80% of therealizable peak for the given polynomial order p . (We reiterate that the peak is taken to be the peak across allcodes in the test suite, for each BP.) Figures 4–5 present the BP results for the massmatrix problem, BP1. Figure 4 uses the gcc compilers,while Figure 5 is based on xlc for Nek5000 andMFEM. Nek5000+gcc sustains 27–33 million degreesof freedom per second (MDOFs or mega -DOFs) forpolynomial orders p > , save for p = 14 and 15,which saturate around 25 MDOFs. For MFEM+gcc,a peak performance of 42 MDOFs is realized for p =3 , which corresponds to × × bricks for eachelement. MFEM realizes >

32 MDOFs for p = 2 –4 and ≈ MDOFs for the majority of the higher-order cases.With gcc, deal.II delivers an impressive 54–64 MDOFsfor p > . The highest values are attained for n/P > , . The n . for deal.II is also high, however. Forexample, for n/P = 100 , , performance is below30 MDOFs for all p > . The rapid fall-off is relatedto the way deal.II distributes elements to MPI ranks.The partitioner insists on having at least 8 elements perrank, rather than insisting on a balanced load. Having8 elements per rank guarantees that 8-wide vectorinstructions can be issued for any polynomial order butinhibits strong scaling.Figure 5 again shows the BP1 results but nowusing the xlc compiler for Nek5000 and MFEM. Inaddition to the xlc compiler, the Nek5000 results areusing intrinsic-directed matrix-matrix product routinesfor tensor contractions when the inner-product looplengths are multiples of 4. Separate timings (not shown)indicate that most of the performance gains derivefrom the xlc compiler, with an additional 5 to 30%coming from the intrinsics, depending on p . We seethe advantage of xlc and the intrinsics, which boostthe peak for Nek5000 to 59 MDOFs for p = 7 at n/P = 180 , and for MFEM to 54 MDOFs at n =190 , for p = 9 . An interesting observation is thatwith xlc, p = 3 is the lowest performer (30 MDOFs)for MFEM (ignoring p = 1 ), whereas it was the highest(43 MDOFs) with gcc.Inspired by the deal.II results on x86, the MFEMteam also investigated the use of xlc/x86 intrinsics,which allow vectoriziation over blocks of elements.Panel (c) of Figure 5, denoted as MFEM xlc/x86, showsthe dramatic improvement resulting from this change. Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × Nek5000 (512 nodes, 32 tasks/node), gcc, BP1 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (a)

Nek5000 gcc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × MFEM (512 nodes, 32 tasks/node), gcc, BP1 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (b)

MFEM gcc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × deal.II (512 nodes, 32 tasks/node), gcc, BP1 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (c) deal.II gcc Figure 4.

BP1 results with gcc compiler using 16,384 MPI ranks on 512 nodes of BG/Q; varying polynomial order( p = 1 , ..., ) and quadrature points ( q = p + 2 ). Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × Nek5000 (512 nodes, 32 tasks/node), xlc, BP1 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (a)

Nek5000 xlc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × MFEM (512 nodes, 32 tasks/node), xlc, BP1 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (b)

MFEM xlc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × MFEM (512 nodes, 32 tasks/node), x86, BP1 p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14 (c)

MFEM xlc/x86

Figure 5.

BP1 results with xlc compiler using 16,384 MPI ranks on 512 nodes of BG/Q; varying polynomial order( p = 1 , ..., ) and quadrature points ( q = p + 2 ). For p = 6 , 7, and 10, MFEM xlc/x86 peaks at around 75MDOFs, and all polynomial orders except p = 2 peakat over 50 MDOFs. Figures 6–7 present the BP results for the stiffness-matrix problem, Au = r , with integration based on q = p + 2 GL points in each direction for each element.The results are similar to those with BP1. Withgcc, Nek5000 realizes 20 MDOFs for p > ; MFEMachieves a peak of 18 MDOFs for p = 3 , and deal.IIreaches a peak of 28 MDOFs. With xlc, Nek5000reaches a peak of 30–35 MDOFs for p = 6 –8 and10, and MFEM reaches 20–22 MDOFs for p = 7 –10. The Nek5000 peak for p = 6 corresponds to q =8 quadrature points, for which the intrinsic, BLAS3-based tensor contractions are highly optimized. ForMFEM, the largest gains once again derive from theswitch to using intrinsics, which lift the MFEM xlc/x86peak to around 30 MDOFs for p = 6 , 7, and 10. BP5 (Figures 8–9) solves a Poisson problem usingthe standard spectral element stiffness matrix in whichquadrature is based on the q = p GLL points, which obviates the need for interpolation, resulting in a shiftfrom (24) to the simpler form (31). The net resultis a nearly twofold increase in MDOFs across allcases. Respectively for Nek5000, MFEM, and deal.II,BP5-gcc realizes peaks of 32, 30, and 60 MDOFs,in contrast to 18, 18, and 28 MDOFs for BP3-gcc.For BP5-xlc, the corresponding peaks are 80 and 25MDOFs for Nek5000 and MFEM. We note that theMFEM xlc/x86 code is not optimized for BP5 becauseit does a redundant interpolation from mesh nodes toquadrature nodes in this case. The xlc/x86 performance,Figure 9(c), is nonetheless marginally improved overthe results of Figure 7(c) because of the slight reductionin the number of quadrature points.

Results for the vector-oriented BPs are shown inFigure (10) for Nek5000 only. For each of these cases,we solve three systems with three different right-handsides. ∗∗ The performance results for BP2, 4, and 6 aresimilar to the corresponding scalar results for BP1, 3, ∗∗ In practice, the systems may differ slightly because of the nature ofthe boundary conditions for each equation, e.g., as is the case whensolving the Navier-Stokes equations with slip conditions for velocity. Points per compute node . . . . . . . . . [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × Nek5000 (512 nodes, 32 tasks/node), gcc, BP3 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (a)

Nek5000 gcc Points per compute node . . . . . . . . . [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × MFEM (512 nodes, 32 tasks/node), gcc, BP3 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (b)

MFEM gcc Points per compute node . . . . . . . . . [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × deal.II (512 nodes, 32 tasks/node), gcc, BP3 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (c) deal.II gcc Figure 6.

BP3 results with gcc compiler on 16,384 MPI ranks on 512 nodes of BG/Q; varying polynomial order( p = 1 , ..., ) and quadrature points ( q = p + 2 ). Points per compute node . . . . . . . . . [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × Nek5000 (512 nodes, 32 tasks/node), xlc, BP3 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (a)

Nek5000 xlc Points per compute node . . . . . . . . . [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × MFEM (512 nodes, 32 tasks/node), xlc, BP3 p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (b)

MFEM xlc Points per compute node . . . . . . . . . [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × MFEM (512 nodes, 32 tasks/node), x86, BP3 p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14 (c)

MFEM xlc/x86

Figure 7.

BP3 results with xlc compiler on 16,384 MPI ranks on 512 nodes of BG/Q; varying polynomial order( p = 1 , ..., ) and quadrature points ( q = p + 2 ). Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × Nek5000 (512 nodes, 32 tasks/node), gcc, BP5 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (a)

Nek5000 gcc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × MFEM (512 nodes, 32 tasks/node), gcc, BP5 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (b)

MFEM gcc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × deal.II (512 nodes, 32 tasks/node), gcc, BP5 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (c) deal.II gcc Figure 8.

BP5 results with gcc compiler on 16,384 MPI ranks on 512 nodes of BG/Q; varying polynomial order( p = 1 , ..., ) and quadrature points ( q = p + 1 ). and 5, with the most evident change being that BP6realizes only 76 MDOFs for p = 7 instead of the 80MDOFs attained for BP5.To better illustrate the potential gain of the vectorapproach, we plot in Figure 11 the ratio of Nek5000results in MDOFs for BP2/BP1, BP4/BP3, andBP6/BP5. The gains for very low n/P must bediscounted because users will generally not run there,especially for p =

1, where Nek5000 is not performant.For large n/P , the ratios are generally slightly in excess of unity, implying that the Nek5000 multicomponentsolver is able to effectively amortize the memoryaccesses for the geometric components in (23). Theseratios trend upwards for smaller n/P values. Nearthe strong-scale limit, which is in the range n/P ≈ , –100,000 for Nek5000, we see up to a 1.25-fold increase in MDOFs, which implies that the vector-oriented problems could potentially run with half asmany processors with the same efﬁciency as their scalarcounterparts. Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × Nek5000 (512 nodes, 32 tasks/node), xlc, BP5 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13 (a)

Nek5000 xlc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × MFEM (512 nodes, 32 tasks/node), xlc, BP5 p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (b)

MFEM xlc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × MFEM (512 nodes, 32 tasks/node), x86, BP5 p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13 (c)

MFEM xlc/x86

Figure 9.

BP5 results with xlc compiler on 16,384 MPI ranks on 512 nodes of BG/Q; varying polynomial order( p = 1 , ..., ) and quadrature points ( q = p + 1 ). Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × Nek5000 (512 nodes, 32 tasks/node), xlc, BP2 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (a)

BP2 xlc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × Nek5000 (512 nodes, 32 tasks/node), xlc, BP4 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13p =14p =15p =16 (b)

BP4 xlc Points per compute node [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × Nek5000 (512 nodes, 32 tasks/node), xlc, BP6 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12p =13 (c)

BP6 xlc

Figure 10.

Nek5000 results with xlc compiler on 16,384 MPI ranks on 512 nodes of BG/Q with varying polynomial order( p = 1 , ..., ). Points per compute node . . . . . . . . B P B P Nek5000 (512 nodes, 32 tasks/node), xlc, BP2/BP1 p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16 (a)

BP2/BP1 Points per compute node . . . . . . . . B P B P Nek5000 (512 nodes, 32 tasks/node), xlc, BP4/BP3 p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15p=16 (b)

BP4/BP3 Points per compute node . . . . . . . . B P B P Nek5000 (512 nodes, 32 tasks/node), xlc, BP6/BP5 p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13 (c)

BP6/BP5

Figure 11.

Nek5000 results with xlc compiler on the performance ratios of BP2/BP1, BP4/BP3, and BP6/BP5 withvarying polynomial order ( p = 1 , ..., ). The results of the preceding sections clearly demon-strate the importance of vector intrinsics in realizinghigh processing rates. We further remark on someof the cross-code performance variations. One of theoptimizations used by deal.II is to exploit the bilateralsymmetry of the GL and GLL points in order to cutthe number of operations in the tensor contractions by a factor of 2 by using an even-odd decomposi-tion (Solomonoff (1992); Kopriva (2009)). The other,as previously mentioned, organizes the data into 8-element blocks in order to combine favorable vectorsizes that help in realizing improved peak performance.Moreover, since the objective of CEED is to developpolyalgorithmic back-ends that will deliver the bestperformance to the users, our objective is to realize theoptimal performance hull over the various implemen-tations. All scaling analysis therefore is based on the best-realized values, not on the values realized by aparticular implementation. The broad trends evident in Figures 4–10 also illustratethe importance of the local workload ( n/P ) in realizingpeak performance. As noted earlier, for a given problemsize n , users will generally run on as many processorsas possible until the efﬁciency drops to a tolerablelevel. This P -fold performance multiplier is the mostsigniﬁcant factor in realizing large (i.e., thousand-foldor million-fold) speedups in HPC, so understandinginhibitors to increasing P is of paramount importance.For BG/Q, the strong-scale inhibitor for linear systemssolves of the type considered here is generally internodelatency and not internode bandwidth, as discussed inFischer et al. (2015) and Raffenetti et al. (2017). Wereturn to this observation shortly.We see in Figure 9(a) that, for p = 7 , Nek5000realizes 80 MDOFs for n/P ≈ , , whereasdeal.II achieves a peak 60 MDOFs for n/P ≈ , .The net result is that for approximately the samework rate (DOFs), whether measured in total energyconsumed or node-hours spent, the ability to strong-scale implies that the Nek5000 implementation will runfully 4 × faster because, for a ﬁxed n , one can usefour times the number of processors. That is, acceptableperformance is realized with n/P = 175 , ratherthan the much larger n/P = 700 , . The point hereis not that one code is superior to another. In fact, wedo not yet have the xlc data for deal.II and expect thatit could be improved substantially. Furthermore, theresults show that deal.II’s current mesh partitioning into8-element blocks clearly inhibits the strong scaling. Thepoint is to stress the importance of achieving reasonableperformance at a relatively low value of n/P ; that is, tobe able to strong-scale. The importance of strong scaling is that it has a directimpact on time to solution when solving problems onlarge HPC systems. For problems such as consideredhere, where the work scales linearly with n , the timeper iteration at 80% efﬁciency will take the form t . = C n . . S , (35)where C is a problem-dependent constant, . S is 80%of the peak realizable speed per node (e.g., the peakMDOFs shown in Figures (4–10)), and n . is valueof n/P where performance matches 80%. The smallerthe value of n . , the more processors that can be usedand the faster the calculation will run. We stress thatreduction of time to solution (at the fast, strong-scalelimit) depends critically on minimization of n . /S andnot solely on increasing the speed of the node, S . Another way of quantifying minimum time per iterationis to plot parametric curves of DOFs vs. time per iteration, where the independent parameter is n/P .Figures 12–13 show a series of such plots for BP1, BP3,and BP5 at order p = 6 and 13. In addition, we plotthe lines corresponding to 80% of peak, which allowidentiﬁcation of the minimum time at which 80% orgreater performance is realized.In the case of p = 6 , the peak for each of the BPs isattained by a different code: MFEM for BP1, Nek5000for BP3, and deal.II for BP5. The minimum time at η =0 . is realized for the same BP/code pairings: 1.2 ms forMFEM–BP1, 1.5 ms for Nek500–BP3, and 1.2 ms fordeal.II–BP5. Each time is determined as the interceptof the DOFs time curve with the . × DOFs peak horizontal line in Figure 12. By contrast, for p = 13 ,the peak performance is realized by deal.II for allthree BPs, whereas the minimum times at η = 0 . arerealized by different codes. For BP1, MFEM-xlc/x86realizes the minimum time of 5.5 ms; this is the left-most point in Figure 13 that is above the 80% line. ForBP3, deal.II has the minimum time-per iteration at 20ms; none of the other codes are above the 80% mark.For BP5, Nek5000 realizes a minimum time of 2.2 msat η = 0 . .For both the p = 6 and 13 cases, we see thatthe minimum-time graphs for the MFEM-xlc/x86 anddeal.II cases exhibit vertical asymptotes as the times arereduced. That is, they cannot deliver smaller times once n/P reaches a critical value. This behavior results fromtheir approach to performance, namely, vectorizingover multiple elements within a single MPI rank. ForNek5000, the minimum times are governed either bycommunication overhead, as in Figure 12 for p = 6 ,or by granularity, as in Figure 13 for p = 13 , wherethe DOFs vs. time curve is essentially ﬂat. For largervalues of p (e.g., p > ), the Nek5000 work ratesare nearly p -independent because there is sufﬁcientwork to dominate communication overhead, even at theminimum n/P values where there is only one elementper MPI rank.We have extended the preceding analysis to allpolynomial orders p and summarize the resultsin Figures 14–16. Figure 14 shows the minimumtime per iteration that sustains (at least) 80% ofpeak. The symbol indicates which code realizes thisminimum time for the given p (D=deal.II, M=MFEM,N=Nek5000). Where two symbols appear at a givendata point, the ﬁrst indicates the code realizing theminimum time while the second indicates the codethat established the peak rate at the given p . Figure 15shows the number of points per node, n . , associatedwith each minimum-time point from Figure 14. Figure16 indicates the corresponding work rate (DOFs) forthese points. This value will generally be r . =0 . × DOFs peak unless the performance at the minimumtime point exceeds that value. The graphs show resultsfor the odd (scalar) cases as solid lines and for the even(vector) cases as dashed lines. − − − − Time per iteration [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × BP1 t . , p=6 Nek5000-xlcNek5000-gccMFEM-xlcMFEM-gccMFEM-x86deal.II-gcc80% cpu max (a)

BP1 t . − − − − Time per iteration . . . . . . . [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × BP3 t . , p=6 Nek5000-xlcNek5000-gccMFEM-xlcMFEM-gccMFEM-x86deal.II-gcc80% cpu max (b)

BP3 t . − − − − Time per iteration [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × BP5 t . , p=6 Nek5000-xlcNek5000-gccMFEM-xlcMFEM-gccMFEM-x86deal.II-gcc80% cpu max (c)

BP5 t . Figure 12.

DOFs vs time per iteration for BP1, BP3, and BP5 on BG/Q with p = 6 for Nek5000, MFEM, deall.II. − − − − Time per iteration [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × BP1 t . , p=13 Nek5000-xlcNek5000-gccMFEM-xlcMFEM-gccMFEM-x86deal.II-gcc80% cpu max (a)

BP1 t . − − − − Time per iteration . . . . . [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × BP3 t . , p=13 Nek5000-xlcNek5000-gccMFEM-xlcMFEM-gccMFEM-x86deal.II-gcc80% cpu max (b)

BP3 t . − − − − Time per iteration [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × BP5 t . , p=13 Nek5000-xlcNek5000-gccMFEM-xlcMFEM-gccMFEM-x86deal.II-gcc80% cpu max (c)

BP5 t . Figure 13.

DOFs vs time per iteration for BP1, BP3, and BP5 on BG/Q with p = 13 for Nek5000, MFEM, deall.II. Polynomial order, p − − − t . M M M M M M M N/M M N/M M/D M/DM M M M M M N M/N M N/M M/D M/D

N: Nek5000D: deal.IIM: MFEM

BP1 and BP2 t . BP1BP2 (a)

BP1 & BP2 t . Polynomial order, p − − − t . M D=M D D N N N N N N/D D DM D=M D D N N N N N N D D

N: Nek5000D: deal.IIM: MFEM

BP3 and BP4 t . BP3BP4 (b)

BP3 & BP5 t . Polynomial order, p − − − t . D/M D D D D N N N N/D N N/D N/DM D D D N N N N N N N/D N/D

N: Nek5000D: deal.IIM: MFEM

BP5 and BP6 t . BP5BP6 (c)

BP5 & BP6 t . Figure 14.

Minimum time per iteration at 80% of peak, t . , for BP1, BP3, and BP5 (solid) and BP2, BP4, and BP6(dashed) vs polynomial order p . A/B indicates the associated codes where code A realizes the minimum time and codeB realizes the peak performance. We begin with a discussion of the scalar cases. Forpolynomial orders p = 1 –11, the minimum time periteration is in the range 0.6 to 2 ms. The implicationof Figure 14 is that, for BG/Q, no simulation will beable to execute PCG iterations at rates faster than0.5 ms per iteration for BP5 or 1 ms for BP1 andBP3 unless that simulation runs at η < . . Addingmore processors does not help reduce the time periteration. These are the strong-scale bounds, and theyare essentially P independent, following the logic thatinitiates with Figures 1–2. The range and near-uniformity of these minimaliteration times are not surprising. In the strong-scalelimit, communication is on par with local work.Moreover, the size of the local messages is relativelysmall, m ≤ ( p + 1) (cid:28) m . Here, m is the size ofa message (in 64-bit words) that costs 2 α , where α is the latency, or 1/2-round-trip message-passing timefor a single 64-bit word. For BG/Q, α = 3 . µ s and m = 845 , as noted by Fischer et al. (2015). At theﬁnest granularity of a single element per processor,there are 26 messages to be exchanged at all polynomial Polynomial order, p n . M M M M M M M N/M M N/M M/D M/DM M M M M M N M/N M N/M M/D M/D

N: Nek5000D: deal.IIM: MFEM

BP1 and BP2 n . BP1BP2 (a)

BP1 & BP2 n . Polynomial order, p n . M D=M D D N N N N N N/M D DM D=M D D N N N N N N/M D D

N: Nek5000D: deal.IIM: MFEM

BP3 and BP4 n . BP3BP4 (b)

BP3 & BP4 n . Polynomial order, p n . D D D D D N N N N N N/D N/DD D D D N/D N N N N N N/D N/D

N: Nek5000D: deal.IIM: MFEM

BP5 and BP6 n . BP5BP6 (c)

BP5 & BP6 n . Figure 15.

Points per node, n . , corresponding to the t . values of Figure 14 with the associated the codes. Polynomial order, p r . × M M M M M M M N/M M N/M M/D M/D

N: Nek5000D: deal.IIM: MFEM

BP1 r . (a) BP1 r . Polynomial order, p r . × M M D D N N N N N N/D D D

N: Nek5000D: deal.IIM: MFEM

BP3 r . (b) BP3 r . Polynomial order, p r . × D/M D D D D N N N N/D N N/D N/D

N: Nek5000D: deal.IIM: MFEM

BP5 r . (c) BP5 r . Figure 16.

Work rate r . = [DOFs x CG Iters.] / [Nodes x Sec.] corresponding to the values in Figures 14–15 with theassociated codes. (The even BP results are the same as the odd.) orders, each having a cost between α and 2 α . OnBG/Q, vector reductions are performed in hardwarewith all-reduce times of approximately α for all P (Fischer et al. (2015)). Given the two vector reductionsrequired for PCG, we conclude that message-passingcosts for a single element per rank on BG/Q lie between (26 + 8) α ≈ . ms and (52 + 8) α ≈ . ms, whichis within the range of 20% of of 0.6 ms (the minimumtime observed) to 2.0 ms. Larger overheads for thelarger times (e.g., 20% of 2.0 ms = 0.4 ms) are likelydue to the fact that there are typically more than 26messages when there is more than one element perrank. †† We see in Figure 14 that beyond p = 11 , there isa rise in minimum iteration times with increasing p ,with the the most rapid rise occurring for BP3, whichis the most work-intensive. According to (30)–(34),the leading-order complexity for BK3 is ≈ p E ,versus ≈ p E for BK1 and BK5. Most of the rise intime per iteration, however, is observed for cases wheredeal.II performs extremely well and the other codes donot deliver 80% of peak. Consequently, the times risesigniﬁcantly (by a factor between 4 and 8) because thepeak for deal.II is realized only for a minimum of 8elements per MPI rank. One of the more signiﬁcant observations fromFigure 14 is that elliptic problems can be solvedwith polynomial orders up to p ≈ in the same (orless) time per iteration as their low-order counterparts.(Effective preconditioning is a different topic, studiedelsewhere, e.g., Lottes and Fischer (2005); Bello-Maldonado and Fischer (2019), and is slated for afuture bake-off study.) In fact, the minimum times forBP5 are realized for p = 9 –11. To understand why, weplot in Figure 15 the number of points per node thatcorresponds to the time points of Figure 14. Here wecan see that the decrease in minimum time for BP5for p = 7 –9 corresponds to a decrease in the numberof points where the minimum time was realized. Fewerpoints are required to meet the 80% criterion with p =9 than with p = 7 , for example. To further test thishypothesis, we plot the corresponding DOFs rates inFigure 16, which is effectively the ratio of the data inFigures 15 and 14. Across the BPs, there is a generalupward trend in DOFs as p increases from 2 to 6.In addition, there are distinct peaks where the mod 4 †† One of the attractions of high-order discontinuous

Galerkinmethods is that they need only 6 (possibly 12) data exchanges perelement, resulting in considerably lower message-latency overheadin the strong-scale limit. intrinsics apply for Nek5000 (e.g., p = 6 and 10 forBP3, which correspond to q = 8 and 12, and p = 7 and 11 for BP5, for which q =8 and 12). The minimain Figure 14 for BP3 and BP5 correspond primarily tolocations where smaller problems are running faster,rather than where the processing rates are higher.Overall, we note that the minimum times in the range p = 2 to 11 correspond to n . ≈ G mn in (22), that is not theleading source of performance improvement. If it were,it would be manifest in the performance-saturated limit,which is not the case. Note that we elected in Figures14–15 to base the 80% performance threshold on the scalar peak. The rationale for doing so is that weview the switch to vector solvers as an improvementon a baseline scalar solver and consequently measureperformance gains against this scalar baseline. With multinode scaling issues addressed through theBP studies, we turn now to node scaling on next-generation accelerator-based architectures. Speciﬁcally,we consider the performance of BK5 and BP5implemented on the Nvidia V100 core on the OLCFSummit using libParanumal, which is an SEM-basedhigh-performance high-order library written in theOCCA kernel language (Medina et al. (2014)). TheSummit system conﬁguration can be found in Table 2.

BK5 amounts to evaluation of the matrix-vectorproduct w L = A L u L , where A L =block-diag( A e ), e =1 , . . . , E . This kernel is fully parallel because A L represents the unassembled stiffness matrix using theformulation in (22). For performance tuning on theV100, we initially consider only the kernel (BK5), notthe full miniapp (BP5), which means we are ignoringcommunication and device-to-host transfer costs. Ourintent is to explore the potential and the limits ofGPU-based implementations of the SEM. We seekto determine what is required to get a signiﬁcantfraction of the V100 peak performance in the contextof distributed-memory parallel computing architecturessuch as OLCF’s Summit.Figure 17(a) shows the performance for (22) ona single GPU core of Summit through a sequenceof OCCA tuning steps, referred to as kernels K =1 , ..., , for BK5. The number of elements is E = 4096 , and the polynomial order p varies from 1 to 15( q = 2 to 16). Each tuned version K is run multipletimes, and the time for the kernel is taken by dividingtotal time by the total number of iterations. Thisprocedure is done to smooth out the noise and to besure that we are not misguided by the clock resolutionlimitations.In the OCCA implementation, each spectral elementis assigned to a separate thread block on the GPUwith a 2D thread structure. Threads within a block areassigned in a 2D layout with an i - j column of the i - j - k spectral element data assigned to each thread. Theindividual kernel formulations, K = 1 to 10, constitutesuccessive improvements in memory management thatare described brieﬂy in the Appendix and in detailin ´Swirydowicz et al. (2019). Kernel 1 sustains 500GFLOPS for p = 8 –15, while Kernels 9 and 10 reach apeak of 2 TFLOPS for the highest polynomial orders.While the V100 is formally capable of 7 TFLOPSin 64-bit arithmetic, it is well known that memorybandwidth demands constrain the realizable peak to amodest fraction of this value. Figure 17(a) also shows agraph of the empirical rooﬂine as a function of p , whichis the theoretical maximum performance that can berealized for BK5 based on the number of operations andbytes transferred. We note that the most highly tunedkernels, K = 9 and 10, attain a signiﬁcant fraction ofthe rooﬂine model. The results here and below thusestablish upper bounds on performance to be expectedin production simulations.The tuning curves of Figure 17(a) were for E =4096 , which is the largest case considered. BecauseGPUs are themselves parallel computers, they havean intrinsic strong-scale limit when the number ofpoints per GPU, n local = n/P , is insufﬁcient to sustainhigh work rates. (In this section, we deﬁne P tobe the number of GPUs, i.e., the number of V100s,used on Summit.) To determine this intrinsic limit,we performed a sequence of BK5 timings using E = 1 , , , , . . . , for a single GPU, P = 1 .Figure 17(b) shows the V100 TFLOPS for BK5 as afunction of n local = p E for p = 1 to 15. The tight datacollapse demonstrates that n local is a leading indicator ofperformance, but the polynomial order is also seen to beimportant, with p = 15 realizing a peak of 2 TFLOPS.Figure 17(c) shows the execution time for the casesof Figure 17(b). Linear dependence is evident at theright side of these graphs, while at the left the executiontime approaches a constant for n local < . In the linearregime, the time scales as O ( n local ) , independent of p ,despite the O ( pn local ) scaling of the FLOPs count. Thisbehavior is expected from the rooﬂine analysis—theperformance of BK5 is bandwidth limited even forthe largest values of p considered. The lower-boundplateaus in Figure 17(c) are readily understood. Atthe smaller values of n local , the number of elementsis less than 80, which is the number of streamingmultiprocessors (SMs, individual compute units) on Polynomial order, p T F L O P S OCCA BK5: Kernel Influence, E=4096

K=1K=2K=3K=4K=5K=6K=7K=8K=9K=10Empirical Roofline (a)

BK5 Tuning Number of gridpoints T F L O P S OCCA BK5 Performance p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15 (b)

BK5 Tﬂops Number of gridpoints T i m e p e r i t e r a t i o n ( S e c ) OCCA BK5 Time per Iteration p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15 (c)

BK5 Timings Number of gridpoints T i m e p e r i t e r a i o n ( S e c ) / P o i n t OCCA BK5 Time per Iteration per Point p=1p=2p=3p=4p=5p=6p=7p=8p=9p=10p=11p=12p=13p=14p=15min_time2 x min_time (d)

BK5 Timings per point

Figure 17.

Single GPU performance: (a) TFLOPS for different kernel tunings. (b) TFLOPS versus problem size n fordifferent polynomial orders, p . (c) Execution time versus n for varying p . (d) Execution time per point versus number ofpoints, n . the V100. Element counts below this value constitutesituations in which SMs are idled and the time periteration is consequently not reduced.Further insight into the ( p, E, P ) performance trade-offs can be gained by looking at the execution timeper point, shown in Figure 17(d), which also showsthe minimal time and the 2 × line, which is twicethe minimal execution time per point. For a ﬁxedtotal problem size, n , moving horizontally on thisplot corresponds to increasing P and reducing n local such that n = n local P . In the absence of communicationoverhead, one gains a full P -fold reduction in theexecution if the time per point does not increasewhen moving to the left. We see in this plot that p = 7 appears to offer the best potential for highperformance, where even at n local = n local is in sharp contrast with the p = 14 and 15cases, which cross the × line at n local = p = 7 case affords a potential 200/30 ≈ p cases. Of course, this analysis must be tempered by consideration of afull solver that includes communication, particularlyfor Poisson problems, which require communication-intensive multilevel solvers for algorithmic efﬁciency.In the next section, we take a step in that direction byanalyzing the BP5 performance on Summit. The BP5 implementation on the V100s employs theoptimally performing BK5 kernel of the precedingsection. All vectors are stored in their local form ,following the Nek5000 storage approach described inAppendix A1. The vector operations for PCG, includ-ing the diagonal preconditioner, are straightforwardstreaming operations with a FLOPs count of n or n and data reads involving only one or two 64-bit wordsper operation. Assembly for the matrix-vector products(the QQ T operation in (36)) is invoked in two stages.Values corresponding to vertices shared within a GPUare condensed on the GPU. Values for vertices sharedbetween GPUs are sent through the host and thencondensed through a pairwise exchange and sum acrossthe corresponding MPI ranks. For all of the studies, we Points per compute node . . . . . . [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] × libP (4 nodes, 6 tasks/node), occa, BP5 p =1p =2p =3p =4p =5p =6p =7p =8p =9p =10p =11p =12 (a) BP5 p = 1 , ..., − − − − Time per iteration [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] BP5 t . , p=4 Nek5000-xlcNek5000-gccMFEM-xlcMFEM-gccMFEM-x86deal.II-gcclibP-occa80% cpu max80% gpu max (b)

BP5 p = 8 − − − − Time per iteration [ D O F s x C G i t e r s ] / [ N o d e s x S ec ] BP5 t . , p=10 Nek5000-xlcNek5000-gccMFEM-xlcMFEM-gccMFEM-x86deal.II-gcclibP-occa80% cpu max80% gpu max (c)

BP5 p = 10 Figure 18.

Multi GPU performance using 24 V100s: (a) BP5 DOFS vs. grid points per node plot for libParanumal (libP)on Summit Summit with p = 1 –10. (b) DOFS vs time per iteration comparison on Summit and BG/Q for t . with p = 8 .(c) DOFS vs time per iteration comparison on Summit and BG/Q for t . with p = 10 . Polynomial order, p . . . . . . . . t . × − D/M D D D D N N N N/D N N/D N/DP P P P P P P P P P P

N: Nek5000D: deal.IIM: MFEMP: libP

BP5 t . (a) BP5 t . Polynomial order, p n . D D D D D N N N N N N/D N/DP P P P P P P P P P P

N: Nek5000D: deal.IIM: MFEMP: libP

BP5 n . (b) BP5 n . Polynomial order, p r . D/M D D D D N N N N/D N N/D N/DP P P P P P P P P P P

N: Nek5000D: deal.IIM: MFEMP: libP

BP5 r . (c) BP5 r . Figure 19.

BP5 time ( t . ), grid points ( n . ), and rate ( r . ), running from libParanumal (libP) on Summit, andNek5000, deal.II, and MFEM on BG/Q. use a total of P = 24 GPUs: six GPUs on each of fournodes of Summit.Figure 18(a) shows DOFS vs. n/P curves analogousto Figures 8–9. At 10,000 MDOFs, the peak V100performance is substantially higher than the peak of80 MDOFs realized for a single node of BG/Q. The n . is also substantially higher on the V100–around15 million for one node of Summit vs. a maximumof 80,000 for BG/Q. In Figures 18(b) and (c), we plotDOFS vs. time for p = 4 and 10 for the V100 and forthe bake-off codes on BG/Q. We see that, for p = 4 ,the minimum time for Summit at 80% of its realizedpeak ‡‡ is smaller than the t . value realized on BG/Q.For P = 10 , Summit can also realize a smaller timeper iteration than BG/Q, but not at the realized 80%efﬁciency value. At that value, the min-time is . seconds, which is about three times slower than t . =0 . seconds attained for p = 10 on BG/Q.Using plots similar to Figures 18(b) and (c), wegenerate time ( t . ), size ( n . ), and DOFs ( r . ) curvesfor libParanumal on Summit and compare these withtheir BG/Q counterparts in Figure 19. For p < ,Summit is able to deliver relatively small values of t . ,subject to the caveat that the r . values for Summitin Fig. 19(b) are not fully saturated. For midrange polynomial orders, p = 7 –10, the minimum times onthe V100 are roughly a factor of two to three higher thanon BG/Q. While the V100 execution rates are higher(Figure 19(c)), they are not sufﬁciently high to makeup for the increased problem size required to sustain80% of the observed peak. This imbalance results inincreased time per iteration, as indicated in (35).We close this section by noting that the strong-scaletime per iteration is not the only performance metricof interest in evaluation of exascale platforms. Costand energy consumption, which are not assessed in thepresent studies, are also important concerns in the driveto exascale, and accelerators have been shown to offeradvantages with respect to these important metrics. We have presented performance results for highly opti-mized matrix-free implementations of high-order ﬁniteelement problems on large-scale parallel platforms and ‡‡ For technical reasons, these tests do not take the problem size forthe lower p values to saturation, which slightly skews the discussion.In practice, however, it’s not clear that any user will be able or willingto run the V100s in saturation mode. These results are thus indicativerather than deﬁnitive. state-of-the-art accelerator nodes. The data are theresults of an initial suite of benchmark problems thatwill be extended in the future to include precondition-ing comparisons, nonsymmetric operators, and othergeometric conﬁgurations. Four teams participated in thestudy—three running on the BG/Q, Cetus, at ALCF,and one running on the OLCF V100-based platform,Summit.For BG/Q, no single strategy was best across all theBPs. Vectorization across elements (i.e., in cases withmore than one element per MPI rank) could in somecases yield greater than 1.25 × speedup over single-element performance, implying that 80% efﬁciencycould not be realized at the granularity limit of asingle element per processor. Reduction in ﬂoating-point operation counts can be realized by exploitingthe bilateral symmetry of the nodal and quadraturepoint distributions. Effective time reductions for thisapproach requires careful local memory management.For p < , the minimum time per iteration at80% efﬁciency is fairly uniform with t . ≈ n/P values of 30,000–70,000 pointsper node for the midrange polynomial orders. ForBP3 and BP5, the midrange processing rates weresubstantially higher than the lower order polynomialcases, implying that, in the same time, the higher-order solutions could process more points (i.e., higherresolution) and achieve higher accuracy (higher-orderpolynomial approximation). In several cases, a twofoldreduction in t . could be realized by solving blockdiagonal systems simultaneously in order to amortizecommunication latency at the strong-scale limit, asdemonstrated in the comparison of BP2, BP4, and BP6with their scalar counterparts.On Summit, we found that highly tuned OCCAkernels could deliver 2 TFLOPS, which is a signiﬁcantfraction of the peak 7 TFLOPS cited for a singleV100 GPU. Moreover, the OCCA performance is nearthe bounds established by bandwidth-limited rooﬂineperformance model for BK5 (and, for other kernels, asshown by ´Swirydowicz et al. (2019)). For BP5, Summitrealizes in excess of 10,000 MDOFs per node, a factorof 125 larger than a single node of Cetus. The n . values, however, are also much larger– approximately vs. on Cetus. Consequently, t . values aresuperior to the values realized on Cetus for p < , butare larger for p = 7 –11, with Summit times 2 to 3 timeslarger than t . found on Cetus. For p > the timesare again comparable, but are signiﬁcantly larger (onboth platforms) than those for p ≤ . Acknowledgments

This research is supported by the Exascale ComputingProject (17-SC-20-SC), a collaborative effort of twoU.S. Department of Energy organizations (Ofﬁce of Science and the National Nuclear Security Admin-istration) responsible for the planning and prepara-tion of a capable exascale ecosystem, including soft-ware, applications, hardware, advanced system engi-neering and early testbed platforms, in support of thenations exascale computing imperative. The researchused resources of the Argonne Leadership ComputingFacility, which is supported by the U.S. Departmentof Energy, Ofﬁce of Science, under Contract DE-AC02-06CH11357. This research also used resourcesof the Oak Ridge Leadership Computing Facility at OakRidge National Laboratory, which is supported by theOfﬁce of Science of the U.S. Department of Energyunder Contract No. DE-AC05-00OR22725. ContractDE-AC02-06CH11357.

Funding

This material is based upon work supported by theU.S. Department of Energy, Ofﬁce of Science, underContract DE-AC02-06CH11357. This Work is alsoperformed under the auspices of the U.S. Departmentof Energy under Contract DE-AC52-07NA27344(LLNL-JRNL-782135). Kronbichler is supported bythe German Research Foundation (DFG) under theproject “High-order discontinuous Galerkin for theEXA-scale” (ExaDG) within the priority program“Software for Exascale Computing” (SPPEXA), grantagreement no. KR4661/2-1, and the BayerischesKompetenznetzwerk f¨ur Technisch-WissenschaftlichesHoch-und H¨ochstleistungsrechnen (KONWIHR).

Appendix: Algorithmic Description

Algorithmic approaches and their implementationsare discussed for Nek5000, MFEM, deal.II, andlibParanumal.

Appendix A1: Nek5000

A key feature of Nek5000 is that all data are storedin local form . That is, for any function in X N ⊂ C , there exists a local vector u L = Qu . (In fact, thelocal form, u L , also exists for discontinuous functions,but the global form, u , does not.) If the function isin C , then shared face and edge values are storedredundantly, which means that computations of innerproducts, daxpys, and other vector operations are doingroughly a factor of (( p + 1) /p ) extra work. Moreover,inner products need to be weighted to account forrepetition of the redundantly stored values unless theyare computed prior to residual assembly. The advantageof the local storage format is that the near-neighbor dataexchange can be effected in a single communicationphase as follows. First, we note that the matrix-vectorproduct w = Bv = Q T B L Qv in local form becomes w L = Qw = Q ( Q T B L Q ) v = QQ T B L v L . (36)The QQ T operation, often referred to as direct-stiffness summation (see Patera (1984)), corresponds to an exchange and sum of values between sharedvertices and is implemented in the general-purposecommunication library gslib , which supports otherassociative/commutative operations (e.g., min, max,multiplication) for scalar and vector ﬁelds (see Fischeret al. (2008)). gslib implements the communicationand local operator evaluation for several Booleanmatrix types other than the symmetric QQ T form. Anadvantage of using the form in (36) is that data that isinterior to elements or on domain boundaries is nevermoved. Unlike Q or Q T in isolation, the symmetricform QQ T avoids reshufﬂing of data between globaland local layouts.For Nek5000, (22) is applied element-by-element,with each local matrix vector product w e = A e u e applied in three phases: gradient of u , application of G ,followed by gradient-transpose. These operations aredetailed in subroutine axe in Algorithm 1. The tensor-product contractions for the gradient and gradient-transpose operations are expressed as BLAS3 matrix-matrix products—though rarely as dgemm , since thatroutine is typically optimized for much larger valuesof p than encountered in FEM/SEM applications.Fast contractions are typically best realized by writingdifferent code for each polynomial order, illustrated forthe case p = 8 shown in mxm8 in Algorithm 1 andthe BG/Q intrinsic version mxm bgq 8 in Algorithm 2.For application of G , all six elements of the tensor arestored for unit-stride access, as seen in axe. Appendix A2: MFEM

MFEM supports a set of HPC extensions targetingCPU architectures. This functionality is implementedas template classes that require the speciﬁcation ofparameters such as the solution’s polynomial order, thetype of the mesh element (quad, hex, etc.), and theorder of the quadrature rule, to be given at compiletime. This allows the compiler to optimize the compute-intensive inner loops by using loop unrolling, inlining,and even autovectorization in some cases. In addition tothe template approach, the HPC extensions implementbetter algorithms, taking full advantage of the tensorproduct structure of the ﬁnite element basis on tensorproduct mesh elements: quads and hexes. Furthermore,the HPC extensions support various levels of operatorassembly and evaluation. These levels include fullmatrix assembly to CSR format, partial assembly(where data at quadrature points is computed and storedduring the assembly and later used in the operatorevaluation), and an “unassembled” approach whereassembly is a no-op and all computations are performedon the ﬂy during the operator evaluation.A recent addition to MFEM is the introduction ofCPU SIMD intrinsics to the HPC extension classeswhere vectorization is explicitly applied across a ﬁxednumber of elements; for example, on CPUs with 256-bit SIMD instructions, four elements are processedsimultaneously by using double-precision arithmetic.

Algorithm 1

Nek5000 implementation pseudocode for(22). mxm8 is called for tensor contraction lengthsof 8. On BG/Q, mxm8 is replaced by mxm bgq 8 ofAlgorithm 2. subroutine ax(rho,wL,uL,g,mask,ur,ur,ut,n,nel,gsh)! Compute wL=A*uL and rho = wL’*uLreal wL(n*n*n,nel),ul(n*n*n,nel,g(6*n*n*n,nel)real mask(n*n*n*nel),ur(n*n*n),us(n*n*n),ut(n*n*n)integer gshrho = 0do e=1,nelcall axe(rho,wL(1,e),uL(1,e),g(1,e),ur,us,ut,n)enddorho = glsum(rho,1) ! sum over all Pcall gs_op(gsh,wL,’+’) ! wL = QQˆT wLdo i=1,n*n*n*nel ! Restrict DirichletwL(i,1)=mask(i)*wL(i,1) ! boundary values to 0enddoendc--------------------------------------------------subroutine axe(rho,w,u,g,ur,us,ut,n)common /derivatives/ D(lnn),Dt(lnn)call loc_grad3 (ur,us,ut,u,n,D,Dt)do i=1,n*n*nwr = g(1,i)*ur(i) + g(2,i)*us(i) + g(3,i)*ut(i)ws = g(2,i)*ur(i) + g(4,i)*us(i) + g(5,i)*ut(i)wt = g(3,i)*ur(i) + g(5,i)*us(i) + g(6,i)*ut(i)ur(i) = wrus(i) = wsut(i) = wtenddocall loc_grad3t (w,ur,us,ut,n,D,Dt)do i=1,n*n*nrho = rho + w(i)*u(i)enddoendc-------------------------------------------------subroutine loc_grad3(ur,us,ut,u,n,D,Dt)real ur(n,n,n),us(n,n,n),ut(n,n,n),u(n,n,n)real D(n,n),Dt(n,n)call mxm(D ,n,u,n,ur,n*n)do k=1,ncall mxm(u(1,1,k),n,Dt,n,us(1,1,k),n)enddocall mxm(u,n*n,Dt,n,ut,n)endc-------------------------------------------------subroutine mxm8(a,n1,b,n2,c,n3)real a(n1,8),b(8,n3),c(n1,n3)do j=1,n3do i=1,n1c(i,j) = a(i,1)*b(1,j) + a(i,2)*b(2,j)+ a(i,3)*b(3,j) + a(i,4)*b(4,j)+ a(i,5)*b(5,j) + a(i,6)*b(6,j)+ a(i,7)*b(7,j) + a(i,8)*b(8,j)enddoenddoend

In this paper, we present results based on the MFEMHPC extensions both with and without the use of SIMDintrinsics. Algorithm 2

QPX Intrinsics routine for C = AB withinner-product length 8. subroutine mxm_bgq_8(a,n1,b,n2,c,n3)real(8) a(n1,8),b(8,n3),c(n1,n3)vector(real(8)) av1,av2,av3,av4,av5,av6,av7,av8vector(real(8)) bv1,bsv1,bsv2,bsv3,bsv4vector(real(8)) bv2,bsv5,bsv6,bsv7,bsv8,cvcall alignx(32, a(1,1))call alignx(32, b(1,1))call alignx(32, c(1,1))do i = 1, n1, 4av1 = vec_ld(0,a(i,1))av2 = vec_ld(0,a(i,2))av3 = vec_ld(0,a(i,3))av4 = vec_ld(0,a(i,4))av5 = vec_ld(0,a(i,5))av6 = vec_ld(0,a(i,6))av7 = vec_ld(0,a(i,7))av8 = vec_ld(0,a(i,8))do j = 1, n3bv1 = vec_ld(0,b(1,j))bv2 = vec_ld(0,b(5,j))bsv1 = vec_splat(bv1, 0)bsv2 = vec_splat(bv1, 1)bsv3 = vec_splat(bv1, 2)bsv4 = vec_splat(bv1, 3)bsv5 = vec_splat(bv2, 0)bsv6 = vec_splat(bv2, 1)bsv7 = vec_splat(bv2, 2)bsv8 = vec_splat(bv2, 3)cv = vec_mul (av1, bsv1)cv = vec_madd(av2, bsv2, cv)cv = vec_madd(av3, bsv3, cv)cv = vec_madd(av4, bsv4, cv)cv = vec_madd(av5, bsv5, cv)cv = vec_madd(av6, bsv6, cv)cv = vec_madd(av7, bsv7, cv)cv = vec_madd(av8, bsv8, cv)call vec_st(cv, 0, c(i,j))enddoenddoendend Implementation Details.

In MFEM, the subdomainrestriction, P , can be represented by two classes:(1) ConformingProlongationOperator when theﬁnite element space has no hanging nodes or (2)

HypreParMatrix in all other cases. All tests in thispaper use the ﬁrst class since we do not considerAMR meshes. This class allows for an optimizedimplementation because all nonzero entries of P areequal to .In the HPC extension of MFEM, the elementrestriction matrix G is represented by an implicittemplate class concept that allows ﬂexible optimizedimplementations. The speciﬁc class used in thenumerical tests is class H1 FiniteElementSpace ,which represents G through an indirection array ofinteger offsets: every row of G has exactly one nonzeroentry that is equal to , and therefore, only its columnindex has to be stored. The class concept assumes,and H1 FiniteElementSpace implements, two main methods,

VectorExtract and

VectorAssemble ,which implement the action of G and G T for oneelement or a small batch of elements.The basis evaluator operator B also is representedin MFEM by a template class concept. In the testsconsidered here, the speciﬁc class template we usedis TProductShapeEvaluator instantiated in 3D (hexelement) for a ﬁxed number of DOFs and quadraturepoints in each spatial dimension. The main methods inthe class are

Calc and

CalcT , for the case of a massmatrix operator, and

CalcGrad and

CalcGradT , forthe case of a diffusion operator.The combined action of G and B for one element, ora small batch of elements, is represented in the templateclass FieldEvaluator , which deﬁnes the methods

Eval and

Assemble for the action of BG and G T B T ,respectively.The discretization operator is represented by thetemplate class concept of a physics“kernel.” The kernelis responsible for deﬁning the speciﬁc form of B thatis required (e.g., B is a different operator for massand diffusion) and the speciﬁc operations for deﬁning(assembling) and using the quadrature point data, D . Inthe numerical tests, we used the kernel class templates TMassKernel and

TDiffusionKernel .Putting all these components together, MFEMdeﬁnes the processor-local version of the discretiza-tion operator, A L = G T B T DBG , as the templateclass

TBilinearForm . The main class methods are

Assemble , for partial assembly, and

Mult , for operatorevaluation.As an example of the implementation, the method

Mult of TBilinearForm can be written roughly(ignoring the template parameters and some of themethod parameters) as follows. void TBilinearForm::Mult(const Vector &x, Vector &y){ y = 0.0;FieldEvaluator solFEval(...,x.GetData(), y.GetData());const int NE = mesh.GetNE(); // num. elementsfor (int el = 0; el < NE; el++){ // declare ’R’ with appropriate typesolFEval.Eval(el, R);kernel_t::MultAssembled(0, D[el], R);solFEval.Assemble(R);}}

Clearly, a lot of the details are hidden in the calls solFEval.Eval() and solFEval.Assemble() .Note that the compiled method

Mult hasno function calls because all methods insideits call tree are force inlined by adding the attribute ((always inline)) speciﬁcationto their deﬁnition.

Appendix A3: deal.II

The deal.II ﬁnite element library (Arndt et al. 2017),written in the C++ programming language, aims toenable rapid development of modern ﬁnite element codes. The library provides support for adaptivemeshes with hanging nodes, a large number ofﬁnite elements, and delivers functions and classesimplementing code patterns in typical matrix assemblytasks with backends to PETSc and Trilinos; see Arndtet al. (2017) and references therein. For high-ordercomputations on elements with tensor product shapefunctions, there is a separate “matrix-free” module(Kronbichler and Kormann 2012) optimized for cache-based architectures, exempliﬁed in the step-37, step-48,and step-59 tutorial programs of the library.The matrix-free module lets the user specify opera-tions on quadrature points in several ways. The mostgeneric interfaces are convenience access operations,organized through a class called FEEvaluation , interms of the values and gradients of the tentative solu-tion and test functions, respectively. The geometric fac-tors are precomputed for a given mesh conﬁguration—which can be deformed through an arbitrary high-ordermapping—and get implicitly applied when accessinga ﬁeld, such as the gradient of the input vector via

FEEvaluation::get gradient() . For the caseof the 3D Laplacian, this approach would access 10data ﬁelds for quadrature points, namely, forthe inverse of the Jacobian ( J e ) − and one ﬁeld forthe determinant of the Jacobian times the quadratureweight ρ . Depending on the operators called at the userlevel, only the necessary data is loaded from memory.For specialized operators, such as the Laplacian, analternative entry point is to access gradients and valuesin the reference coordinates and apply geometry and theoperator by an effective tensor ρ ( J e ) − ( J e ) − T , whichreduces the data access to 6 ﬁelds for the symmetricrank-2 tensor. This effective tensor format is used forthe BP3–BP6 problems with deal.II.The global input and output vectors in the matrix-freeoperator evaluation connect to the tentative solution andtest function slot on the element level, respectively.The vectors are natively implemented in deal.II andstored and treated in fully distributed format outside theintegration tasks. Their local arrays contain some spacebeyond the end of the locally owned range in orderto support an in-place transformation via MPI Isend and

MPI Irecv without a deep copy of all vectorentries. Furthermore, the extraction of the local vectorfor the cell operations is done immediately prior to theinterpolation operations B and B T to ensure that thedata remains in caches of the processor.The actions of G , B , B T , and G T are provided bythe read dof values , evaluate , integrate ,and distribute local to global functions ofthe class FEEvaluation . Since the interpolationoperations B and B T are arithmetically heavy alsowhen using optimal sum factorization algorithms (Dev-ille et al. (2002)), deal.II uses SIMD instructions onsupported architectures with vectorization over severalcells (Kronbichler and Kormann 2012). The SIMDarrays “leaks” into the user code in the form of a tensor of SIMD type called VectorizedArray in FEEvaluation::get gradient() , with eachlane of the SIMD vector returning data to a differentcell. However, typical application codes are templatedon the return type and become almost transparent of thisfact. The C++ template arguments also enable auxiliaryoperators, such as multigrid V-cycles for precondition-ing, to be easily implemented in single precision. Thecode automatically detects the case when node pointsand quadrature points coincide, as well as the moregeneral case when one ﬁrst must interpolate from thenodal values into quadrature points. The arithmetic oflocal operations is reduced almost half by a so-calledeven-odd decomposition that speeds up higher-ordercomputations (Solomonoff 1992; Kopriva 2009).Inside the one-dimensional kernels of sum factor-ization, the deal.II implementation applies a particu-lar optimization to reduce the arithmetic operationsby the so-called even-odd decomposition (Solomonoff1992; Kopriva 2009). It relies on the fact that thebasis functions and quadrature points are symmetricabout the center of the one-dimensional referenceelement ( − , . If we denote the basis functions ofdegree p by ϕ i , i = 1 , . . . , p + 1 , and quadrature pointsby ξ j , j = 1 , . . . , q , the symmetry implies ϕ i ( ξ j ) = ϕ p +2 − i ( ξ q +1 − j ) . For example, the one-dimensionalinterpolation matrix S ji = ϕ i ( ξ j ) for the case p = 3 and q = 5 is of the form S =  s s s s s s s s s s s s s s s s s s s s  , (37)for some numbers s , . . . , s , representing only distinct entries in this ( q ) × ( p + 1) = 5 × matrix.The interpolation from the p + 1 nodal values to the q quadrature points (i.e., the matrix-vector product v = Su ) is then decomposed in the following three steps: • Decompose the vector u into the even part u + and the odd part u − . For the × example, thisyields u +1 = u + u , u +2 = u + u ,u − = u − u , u − = u − u . • Perform the matrix-vector product on the evenand odd contributions separately, v + = S + u + , v − = S − u − , with the (cid:6) q (cid:7) × (cid:6) p +12 (cid:7) matrix S + with entries S + ij = 12 ( S i,j + S i,p +1 − i ) and the (cid:4) q (cid:5) × (cid:4) p +12 (cid:5) matrix S − with entries S − ij = 12 ( S i,j − S i,p +1 − i ) . • Form the ﬁnal vector v by concatenating the evenand odd parts of v + and v − , respectively. For the × example, the result is v = v +1 + v − , v = v +2 + v − , v = v +3 ,v = v +2 − v − , v = v +1 − v − . Note that the entries of the matrices S + and S − inter-nally contain an addition and a multiplication, but thesecontributions are precomputed. As a consequence, the ( p + 1) q additions and multiplications (fused multiply-accumulates, FMAs) of the naive matrix-vector productare turned into two matrix-vector multiplications of size p +12 × q , thus reducing the number of FMAs to ( p +1) q .Since the number of additions and subtractions ( p +1) + q is of lower complexity, the even-odd decom-position approximately halves the operation count forthe 1D kernel. Similar techniques are possible forthe reference-cell derivatives with symmetry relation ϕ (cid:48) i ( ξ j ) = − ϕ (cid:48) p +2 − i ( ξ q +1 − j ) .An important ingredient to make the even-odddecomposition competitive is the way the intermediateresults u ± and v ± are handled by the implementation.Since the small matrix multiplication kernels are oftenlimited by the load/store (to L1 cache) rather than actualarithmetics, all temporary results should ideally be keptin registers. In deal.II, the loop kernels are written usingC++ templates on both p and q , giving loop boundsand vector sizes that are compile-time constants. Upto degrees of approximately p = 9 , register allocationpolicies of compilers such as gcc keep the values u +1 , . . . , u +5 and u − , . . . , u − in 10 registers. With twoadditional registers for the two sums v + j and v − j thatare eventually written into the respective positions v j and v q +1 − j , and some spare registers to hold someentries of S + and S − , no extra load/store operationsare necessary for architectures with 16 visible ﬂoating-point registers, assuming the nowadays usual out-of-order execution and register renaming capabilitiesto hide the latency of the two FMAs. As shownby the analysis of Kronbichler and Kormann (2019),performance is slightly lower for p = 1 on architectureswith FMA support, similar for p = 2 , , but by . × to . × for p ≥ , when compared with optimizeddense matrix multiplication kernels for small matricesanalogous to what is done in libxsmm (Heinecke et al.2015). Appendix A4: libParanumal

The GPU results of Section 8 were based on thelibParanumal library, which is an experimental testbedfor exploring high-performance kernels that can beintegrated into existing FEM/SEM codes as acceleratormodules. The kernel code is written in OCCA (Medinaet al. (2014)), which can produce performant CUDA,OpenMP, or OpenCL code. All the current cases wererun on OLCF Summit’s Nvidia V100s using CUDA. Here, we brieﬂy describe the libParanumal devel-opment for BK5 through the ten successive kernelsindicated in Figure 17(a). Each kernel K includesthe optimizations in kernel K − unless otherwiseindicated. These kernels are described in detail by´Swirydowicz et al. (2019). • Kernel 1 corresponds to a straightforwardimplementation of (22). Each spectral elementis assigned to a separate thread block on theGPU with a 2D thread structure. Threads withina block are assigned to an i - j column in the i - j - k spectral element data structure. Entries of u e arefetched from global memory as needed, and thepartial sums in the last loop are added directly tothe global memory array to store the ﬁnal result.This kernel achieves 500 GFLOPS, which is one-ﬁfth of the predicted empirical rooﬂine. • In Kernel 2 all variables except for the ﬁnal resultare declared by using const . The gains aremarginal for p > . • In Kernel 3 all of the k loops are unrolled, leadingto signiﬁcant gains for p > . • In Kernel 4 the k loop is placed exterior to the i and j thread loops, making k the slowest runningindex. • In Kernel 5 an auxiliary shared-memory array isreplaced by two smaller ones. • In Kernel 6 the input vector u e is fetched intoshared memory. • In Kernel 7 global memory transactions arereduced by caching data at the beginning ofthe kernel and writing the output variable onlyonce. This optimization improves performanceby about 40%. • In Kernel 8 arrays are padded for the cases p = 7 and 15 to avoid shared-memory bank conﬂicts. • In Kernel 9 u e is fetched to registers, p entriesat a time. • In Kernel 10 three two-dimensional shared-memory arrays are used for partial results.The results for Kernel 10 are aligned with the empiricalrooﬂine model and achieve a peak of 2 TFLOPS for p = 15 . Algorithm 3 gives the pseudocode. References

Arndt D, Bangerth W, Davydov D, Heister T, Heltai L,Kronbichler M, Maier M, Pelteret JP, Turcksin B andWells D (2017) The deal.II library, version 8.5.

Journal of Numerical Mathematics .Bello-Maldonado P and Fischer P (2019) Scalable low-orderﬁnite element preconditioners for high-order spectralelement Poisson solvers.

SIAM J. Sci. Comp.

41: S2–S18.CEED (2019) Center for Efﬁcient Exascale Discretizations,Exascale Computing Project, DOE. https://ceed.exascaleproject.org . Algorithm 3 libParanumal pseudocode for BK5, w L = A L u L corresponding to Kernel 10 of 17(a).

1: for e ∈ { , , . . . , E } do2: for i, j, k ∈ { , , . . . , p } do3: If k = 0 , load ˆ D to shared memory4: Declare register variables r qr , r qs , r qt .5: end for6: for i, j, k ∈ { , , . . . , p } do7: Load J (Jacobian)8: Declare q r , q s , q t :=0.9: q r = (cid:80) n ˆ D in u enji q s = (cid:80) n ˆ D jn u ekni q t = (cid:80) n ˆ D kn u ekjn

12: Set r qr = q r , r qs = q s , r qt = q t .13: w ekji = βJ ekji ∗ u ekji

14: end for (synchronize threads)15: for i, j, k ∈ { , , . . . , p } do16: Load G , G G .17: Sqtemp kji = G ∗ r qr + G ∗ r qs + G ∗ r qt

18: end for (synchronize threads)19: for i, j, k ∈ { , , . . . , p } do20: w ekji = w ekji + (cid:80) n ˆ D ni ∗ Sqtemp kjn

21: end for22: for i, j, k ∈ { , , . . . , p } do23: Load G , G G .24: Sqtemp kji = G ∗ r qr + G ∗ r qs + G ∗ r qt

25: end for (synchronize threads)26: for i, j, k ∈ { , , . . . , p } do27: w ekji = w ekji + (cid:80) n ˆ D nj ∗ Sqtemp kni

28: end for29: for i, j, k ∈ { , , . . . , p } do30: Load G , G G .31: Sqtemp kji = G ∗ r qr + G ∗ r qs + G ∗ r qt

32: end for (synchronize threads)33: for i, j, k ∈ { , , . . . , p } do34: w ekji = w ekji + (cid:80) n ˆ D nk ∗ Sqtemp nji

35: end for36: end for (end of e loop) Deville M, Fischer P and Mund E (2002)

High-order methodsfor incompressible ﬂuid ﬂow . Cambridge: CambridgeUniversity Press.Fischer P (1989)

Spectral element solution of the Navier-Stokes equations on high performance distributed-memory parallel processors . PhD Thesis, MassachusettsInstitute of Technology. Cambridge, MA.Fischer P, Heisey K and Min M (2015) Scaling limitsfor PDE-based simulation (invited). In: . AIAA2015-3049.Fischer P, Lottes J, Pointer W and Siegel A (2008) Petascalealgorithms for reactor hydrodynamics.

J. Phys. Conf.Series

A Legendre spectral element method forsimulation of incompressible unsteady viscous free-surface ﬂows . PhD Thesis, Massachusetts Institute ofTechnology. Cambridge, MA.Kopriva D (2009)

Implementing spectral methods for partialdifferential equations . Berlin: Springer. Kronbichler M and Kormann K (2012) A generic interfacefor parallel cell-based ﬁnite element operator application.

Computers & Fluids

63: 135–147. DOI:10.1016/j.compﬂuid.2012.04.012.Kronbichler M and Kormann K (2019) Fast matrix-freeevaluation of discontinuous Galerkin ﬁnite elementoperators.

ACM Trans. Math. Softw.

J. Sci.Comput.

24: 45–78.Medina DS, St-Cyr A and Warburton T (2014) OCCA: Auniﬁed approach to multi-threading languages. arXivpreprint arXiv:1403.0968 .MFEM (2019) Modular ﬁnite element methods. https://mfem.org .Orszag S (1980) Spectral methods for problems in complexgeometry.

J. Comput. Phys.

37: 70–92.Patera A (1984) A spectral element method for ﬂuid dynamics: laminar ﬂow in a channel expansion.

J. Comp. Phys.

Proc. Supercomp. 2017 .IEEE.Rønquist E (1988)

Optimal Spectral Element Methods forthe Unsteady Three-Dimensional Incompressible Navier-Stokes Equations . PhD Thesis, Massachusetts Institute ofTechnology. Cambridge, MA.Rønquist E and Patera A (1987) A Legendre spectral elementmethod for the Stefan problem.

Int. J. Numer. Meth. Eng.

24: 2273–2299.Solomonoff A (1992) A fast algorithm for spectraldifferentiation.

J. Comp. Phys.

98 (1): 174–177.´Swirydowicz K, Chalmers N, Karakus A and WarburtonT (2019) Acceleration of tensor-product operations forhigh-order ﬁnite element methods.

Int. J. of HighPerformance Comput. App.

10 Author Biographies

Paul Fischer is a Blue Waters Professor of ComputerScience and Mechanical Science and Engineering atthe University of Illinois at Urbana-Champaign andan Argonne senior scientist. He is a Fellow of theAmerican Association for the Advancement of Science(AAAS). He is the chief architect of the ﬂuid thermalsimulation code Nek5000, which scales to over amillion processors and has been recognized with theGordon Bell Prize in high-performance computing.Nek5000 is used by over 400 researchers worldwide in a variety of thermal ﬂuids applications. He wasthe deputy director for the DOE-ASCR Co-DesignCenter for Exascale Simulation of Advanced Reactor(CESAR) and is currently the deputy director for theDOE-ECP Co-Design Center for Efﬁcient ExascaleDiscretizations (CEED). Misun Min is a computational scientist in the Math-ematics and Computer Science Division at ArgonneNational Laboratory. Her research focuses on devel-oping scalable high-order algorithms and software forsolving electromagnetic, drift-diffusion, and ﬂuid sys-tems on advanced high-performance computing archi-tectures. She is the author of the spectral elementdiscontinuous Galerkin (SEDG) electromagnetics sim-ulation code, NekCEM, which scales to over millionCPU cores and tens of thousands GPUs. She wonan R&D100 award on “NekCEM/Nek5000: ScalableHigh-Order Simulation Codes.” She is a PI on the DOEApplied Mathematics Research project “High-OrderMethods for High-Performance Multiphysics Simula-tions” and the Argonne PI of the DOE-ECP Co-Designproject, CEED.

Thilina Rathnayake is a Ph.D. candidate in theDepartment of Computer Science at the Universityof Illinois at Urbana-Champaign. Thilina worked asan intern at Lawrence Livermore National Laboratoryin 2017 and a Givens Associate in the Mathematicsand Computer Science Division at Argonne NationalLaboratory in 2018–2019. His research has focused onthe library package libCEED for the DOE Co-DesignCenter for Efﬁcient Exascale Discretizations, the CFDsolver Nek5000, and the communication library gslib.

Som Dutta is currently an assistant professor in theDepartment of Mechanical and Aerospace Engineeringat Utah State University (USU). He contributed tothe paper during his post-doctoral training in theDepartment of Computer Science at the Universityof Illinois at Urbana-Champaign (UIUC), where hewas part of the DOE-funded CEED project. Beforejoining USU, he was a postdoctoral researcher in theDepartment of Mathematics at College of Staten Island,City University of New York (CUNY). Som received aPh.D. from the Department of Civil and EnvironmentalEngineering at UIUC. His research interests are instudying environmental and turbulent multiphase ﬂowsusing high-order spectral element methods.

Tzanio Kolev is a computational mathematician atthe Center for Applied Scientiﬁc Computing (CASC)in Lawrence Livermore National Laboratory (LLNL),where he works on ﬁnite element discretizations andsolvers for problems in compressible shock hydrody-namics, multi-material arbitrary Lagrangian Eulerianmethods, radiation hydrodynamics, and computationalelectromagnetics. He won an R&D100 award as amember of the hypre team. Tzanio is leading the high-order ﬁnite element discretization research and devel-opment efforts in the MFEM and BLAST projects in CASC and is the director of the Center for EfﬁcientExascale Discretization in DOE’s Exascale ComputingProject.

Veselin Dobrev is a computational mathematicianin the numerical analysis and simulations group inthe Center for Applied Scientiﬁc Computing. Hisresearch interests are in the areas of numerical methodsfor solving PDEs, which include ﬁnite element anddiscontinuous Galerkin methods, shock hydrodynamicssimulations, and iterative and multigrid methods.Veselin received his Ph.D. in mathematics from TexasA&M University in 2007. He is currently working onhigh-order curvilinear ﬁnite elements for Lagrangianhydrodynamics (BLAST project).

Jean-Sylvain Camier works in the Center for AppliedScientiﬁc Computing (CASC), Lawrence LivermoreNational Laboratory. His current research focus ison computer architecture; parallel computing; andcomputing in mathematics, natural science, engineeringand medicine.

Martin Kronbichler is a senior researcher at theInstitute for Computational Mechanics, TechnicalUniversity of Munich, Germany. He is a principaldeveloper of the deal.II ﬁnite element library and leadsthe high-order as well as high-performance activities inthis project. His current research focus is on efﬁcienthigh-order discontinuous Galerkin schemes and fastiterative solvers for ﬂow problems as a PI in theexascale project ExaDG within the German priorityprogram SPPEXA.

Tim Warburton is the John K. Costain FacultyChair in the College of Science and a professor ofmathematics at Virginia Tech. He also currently holdsan appointment in the Department of Computationaland Applied Mathematics at Rice University. Hedeveloped the ﬁrst high-order nodal discontinuousGalerkin solver for time-domain electromagnetics onunstructured grids and led a decadal project toaccelerate these methods by devising parallel localtime-stepping methods, GPU acceleration, co-volumeﬁltering techniques, novel rational bases for curvilinearelements, and numerous other innovations . He co-authored the ﬁrst comprehensive book on discontinuousGalerkin methods. He created the Open ConcurrentCompute Abstraction (OCCA) as part of the CESARco-design center at Argonne. The OCCA frameworkenables domain scientists and computational scientiststo write portable threaded code. The OCCA library hasbeen used as a foundational layer for higher-order ﬁniteelement, spectral element, discontinuous Galerkin, andﬁnite difference PDE solvers for industrial applicationsand lab miniapps.