[PDF] hyper.deal: An efficient, matrix-free finite-element library for high-dimensional partial differential equations

Abstract

This work presents the efficient, matrix-free finite-element library hyper.deal for solving partial differential equations in two to six dimensions with high-order discontinuous Galerkin methods. It builds upon the low-dimensional finite-element library deal.II to create complex low-dimensional meshes and to operate on them individually. These meshes are combined via a tensor product on the fly and the library provides new special-purpose highly optimized matrix-free functions exploiting domain decomposition as well as shared memory via MPI-3.0 features. Both node-level performance analyses and strong/weak-scaling studies on up to 147,456 CPU cores confirm the efficiency of the implementation. Results of the library hyper.deal are reported for high-dimensional advection problems and for the solution of the Vlasov--Poisson equation in up to 6D phase space.

Full PDF

hhyper.deal: An eﬃcient, matrix-free ﬁnite-element library forhigh-dimensional partial diﬀerential equations ∗ Peter Munch †‡ Katharina Kormann § Martin Kronbichler † February 20, 2020

Abstract

This work presents the eﬃcient, matrix-free ﬁnite-element library hyper.deal for solvingpartial diﬀerential equations in two to six dimensions with high-order discontinuous Galerkinmethods. It builds upon the low-dimensional ﬁnite-element library deal.II to create complexlow-dimensional meshes and to operate on them individually. These meshes are combined via atensor product on the ﬂy and the library provides new special-purpose highly optimized matrix-free functions exploiting domain decomposition as well as shared memory via

MPI-3.0 features.Both node-level performance analyses and strong/weak-scaling studies on up to 147,456 CPUcores conﬁrm the eﬃciency of the implementation. Results of the library hyper.deal arereported for high-dimensional advection problems and for the solution of the Vlasov–Poissonequation in up to 6D phase space.

Key words. matrix-free operator evaluation, discontinuous Galerkin methods, high-dimensional,high-order, Vlasov–Poisson equation, MPI-3.0 shared memory.

Three-dimensional problems are today simulated in great detail based on domain decompositioncodes up to supercomputer scale. With the increase in computational power, it becomes feasi-ble to simulate also higher-dimensional problems. With this contribution, we target the case ofmoderately high-dimensional ( ≤ ∗ This work was supported by the German Research Foundation (DFG) under the project “High-order dis-continuous Galerkin for the exa-scale” (ExaDG) within the priority program “Software for Exascale Computing”(SPPEXA), grant agreement no. KO5206/1-1 and KR4661/2-1. The authors gratefully acknowledge the GaussCentre for Supercomputing e.V. ( ) for funding this project by providing computing time onthe GCS Supercomputer SuperMUC-NG at Leibniz Supercomputing Centre (LRZ, ) through project idpr83te. † Institute for Computational Mechanics, Technical University of Munich, Boltzmannstr. 15, 85748 Garching,Germany ( { munch,kronbichler } @lnm.mw.tum.de ). ‡ Institute of Materials Research, Materials Mechanics, Helmholtz-Zentrum Geesthacht, Max-Planck-Str. 1, 21502Geesthacht, Germany ( [email protected] ). § Max Planck Institute for Plasma Physics, Boltzmannstr. 2, 85748 Garching, Germany, and De-partment of Mathematics, Technical University of Munich, Boltzmannstr. 3, 85748 Garching, Germany,( [email protected] ). a r X i v : . [ c s . M S ] F e b yper.deal , we are targeting these applications: For a phase-space simulation, two separate—possibly complex—meshes describing conﬁguration and velocity space are deﬁned based on the ca-pabilities of a base low-dimensional ﬁnite-element library—in our case the library deal.II [3; 5]—and combined by taking their tensor product on the ﬂy. The equations that we have in mind areadvection-dominated, which is why we mostly focus on the discretization of the advection equationin this paper.The general concept presented is not limited to the description of the phase space. Possible ap-plications of a high-dimensional FEM library could for instance include three-dimensional problemsthat involve a low-dimensional parameter space or low-dimensional Fokker–Planck-type equations. While the decomposition of complex domains in up to three dimensions is a well-studied problem,partial diﬀerential equations on complex domains in dimensions higher than three are not tackledby generic FEM libraries to date. Some libraries have started to extend their capabilities to higherdimensions on structured grids. An example is the

YaspGrid module of the ﬁnite-element library

DUNE , which implements structured grids in arbitrary dimensions [8]. Also, higher-dimensionalproblems are solved based on sparse grids [15], reduced basis methods, or low-rank tensors [6].On the other hand, in the plasma community some specialized codes exist that target gyrokineticor kinetic equations. The

Gekyll code [16] is a discontinuous Galerkin-method solver for plasmaphysics applications, and a fully kinetic version for Cartesian grids was presented in [18]. Theauthors use Serendipity elements to reduce the number of degrees of freedom. On the other hand,such a reduced basis does not form a tensor product and is therefore not amenable to the algorithmspresented in this work. The speciﬁcs of domain decomposition in higher dimensions have only beenstudied recently. In [22], a six-dimensional domain decomposition for a semi-Lagrangian solver ona Cartesian grid of the Vlasov–Poisson system was studied, highlighting the high demands onmemory transfer between neighboring processes due to the increased surface-to-volume ratio withincreasing dimensionality. The parallelization of a similar algorithm has also been addressed in[34]. However, the domain decomposition is limited to conﬁguration space in that work, whichposes a strong limit to the scalability of the implementation.

The Vlasov equation is given as ∂∂ t f s ( t, (cid:126)x, (cid:126)v ) + (cid:126)v · ∇ (cid:126)x f s ( t, (cid:126)x, (cid:126)v ) + q s m s (cid:16) (cid:126)E ( t, (cid:126)x ) + (cid:126)v × (cid:126)B ( t, (cid:126)x ) (cid:17) · ∇ (cid:126)v f s ( t, (cid:126)x, (cid:126)v ) = 0 , (1)where f s ( t, (cid:126)x, (cid:126)v ) denotes the probability density of a particle species s with charge q s and mass m s as a function of the phase space, (cid:126)E the electric ﬁeld, and (cid:126)B the magnetic ﬁeld. The phase space isdeﬁned as the tensor product of conﬁguration space, (cid:126)x ∈ Ω (cid:126)x ⊂ R d (cid:126)x , and velocity space, (cid:126)v ∈ R d (cid:126)v .This equation is coupled either to the Maxwell or to the Poisson equation (see Section 6) for theself-consistent ﬁelds. If we deﬁne the gradient operator as ∇ (cid:62) := ( ∇ (cid:62) (cid:126)x , ∇ (cid:62) (cid:126)v ), Equation (1) can berewritten as ∂∂t f s ( t, (cid:126)x, (cid:126)v ) + ∇ · (cid:32)(cid:32) (cid:126)v q s m s (cid:16) (cid:126)E ( t, (cid:126)x ) + (cid:126)v × (cid:126)B ( t, (cid:126)x ) (cid:17) (cid:33) f s ( t, (cid:126)x, (cid:126)v ) (cid:33) = 0 . (2)This is a non-linear, high-dimensional ( d = d (cid:126)x + d (cid:126)v ), hyperbolic partial diﬀerential equation. Tosimplify the presentation in the following, we will consider the advection equation deﬁned on thearbitrary d -dimensional domain Ω, denoting the independent variable again by (cid:126)x , ∂f∂t + ∇ · ( (cid:126)a ( f, t, (cid:126)x ) f ) = 0 on Ω × [0 , t ﬁnal ] , (3)with (cid:126)a ( f, t, (cid:126)x ) being the solution-, time-, and space-dependent advection coeﬃcient. The systemis closed by an initial condition f (0 , (cid:126)x ) and suitable boundary conditions. In the following, we con-centrate on periodic boundary conditions. We return to the Vlasov–Poisson equation in Section 6where we combine an advection solver from the library hyper.deal and a Poisson solver based on deal.II . 2 .3 Discontinuous Galerkin discretization of the advection equation High-order discontinuous Galerkin methods are attractive methods for solving hyperbolic partialdiﬀerential equations like Equation (3) due to their high accuracy in terms of dispersion anddissipation, while maintaining geometric ﬂexibility through unstructured grids. The discontinuousGalerkin (DG) discretization of these equations can be derived by the following steps: The partialdiﬀerential equation is multiplied with the test function g , integrated over the element domain Ω ( e ) with (cid:93) e Ω ( e ) = Ω, and the divergence theorem is applied: (cid:18) g, ∂f∂ t (cid:19) Ω ( e ) − (cid:18) ∇ g, (cid:126)af (cid:19) Ω ( e ) + (cid:28) g, (cid:126)n · ( (cid:126)af ) ∗ (cid:29) Γ ( e ) = 0 , (4)with ( (cid:126)af ) ∗ being the numerical ﬂux, like a central or an upwind ﬂux. Integration (cid:82) Ω dΩ andderivation ∇ are not performed in the real space but in the reference space Ω ( e )0 and Γ ( e )0 andrequire a mapping to the reference coordinates: (cid:18) g, |J | ∂f∂ t (cid:19) Ω ( e )0 − (cid:18) J − T ∇ (cid:126)ξ g, |J | (cid:126)af (cid:19) Ω ( e )0 + (cid:28) g, |J | (cid:126)n · ( (cid:126)af ) ∗ (cid:29) Γ ( e )0 = 0 , (5)where J is the Jacobian matrix of the mapping from reference to real space and |J | its determinant.To discretize this equation in space, we consider nodal polynomials with nodes in the Gauss–Lobatto points. The integrals are evaluated numerically by weighted sums. We consider both theusual Gauss(–Legendre) quadrature rules and the integration directly in the Gauss–Lobatto pointswithout the need of interpolation (collocation setup [13]). The resulting semi-discrete system hasthe following form: M ∂ (cid:126)f∂t = A ( (cid:126)f , t ) ↔ ∂ (cid:126)f∂t = M − A ( (cid:126)f , t ) , (6)where (cid:126)f is the vector containing the coeﬃcients for the polynomial approximation of f , M themass matrix, and A the discrete advection operator. This system of ordinary diﬀerential equationscan be solved with classical time integration schemes such as explicit Runge–Kutta methods. Theyrequire the right-hand side M − A ( (cid:126)f , t ) to be evaluated eﬃciently. The particular structure of themass matrix M should be noted: It is diagonal in the collocation case and block-diagonal in thecase of the normal Gauss quadrature, which leads to a simple inversion of the mass matrix [27].For 2D and 3D high-order DG methods, eﬃcient matrix-free operator evaluations [24; 25]for individual operators M − , A as well as for the merged operator M − A are common in thecontext of ﬂuid mechanics [23], structural mechanics [11], and acoustic wave propagation [32].These methods do not assemble matrices, but evaluate the eﬀect of the (linear or non-linear)operator on element vectors on the ﬂy, often exploiting the tensor-product structure of the shapefunctions on hexahedron meshes. Since generally the access to main memory is the major limitingfactor on modern CPU hardware, the fact that no matrix has to be loaded for matrix-free operatorevaluation directly translates into a signiﬁcant performance improvement. Discontinuous Galerkinmethods and matrix-free operator evaluation kernels are part of many low-dimensional general-purpose FEM libraries nowadays. This work shows that an extension of low-dimensional general-purpose high-order matrix-free FEMlibraries to high dimensions is possible. We discuss how to cope with possible performance deteri-orations due to the “curse of dimensionality” and present special-purpose concepts exploiting thestructure of the phase space, using domain decomposition and shared memory, all taking hardwarecharacteristics into account.All concepts that we describe in this work have been implemented and are available underthe LGPL 3.0 license as the library hyper.deal hosted at https://github.com/hyperdeal/hyperdeal . It extends—as an example—the open-source FEM library deal.II [3] to high dimen-sions. Both node-level performance and strong/weak-scaling analyses, conducted for this library,3 T T sfc 0 132 45 L i L ∪ G Figure 1: Generation of a low-dimensional mesh in the base library deal.II : a) coarse grid; b-c)reﬁnement of a coarse grid; d) enumeration of cells along a space-ﬁlling curve (sfc); e) partitioningof the space-ﬁlling curve; f) local view of process 4 (showing only active local cells and active ghostcells)conﬁrm the suitability of the described concepts for solving partial diﬀerential equations in highdimensions.The remainder of this work is organized as follows. In Section 2, we present the concepts ofa generic modern low-dimensional ﬁnite-element library; special focus is put on mesh generation,parallelization, and eﬃcient matrix-free operator evaluation as well as on why these concepts areonly partly applicable for higher-dimensional problems. Sections 3-4 give an introduction into theconcept of a tensor product of partitioned low-dimensional meshes and details its implementationfor phase space. Section 5 presents performance results for the advection equation, conﬁrmingthe eﬃciency of the design decisions made during the implementation. Section 6 explains how hyper.deal can eﬃciently be combined with a deal.II -based Poisson solver and shows scaling re-sults for a benchmark problem from plasma physics. Finally, Section 7 summarizes our conclusionsand points to further research directions.

In the following, we describe building blocks commonly used in modern general-purpose FEMlibraries with a special focus on triangulations, parallelization, and eﬃcient matrix-free methods.The concept of this work is generic and could in principle be built upon any general-purposeFEM library such as MFEM [4], DUNE [12], FEniCS [2], or Firedrake [9]. Our description is,however, specialized to the implementation in hyper.deal that is constructed on top of the deal.II library and uses some of naming conventions in that project.

A triangulation T covers the computational domain Ω and is deﬁned by a set of point coordinates P as well as by a set of cells C . Each cell c ∈ C consists of a ﬁxed number of vertices. Since weconsider unstructured quad/hex-only meshes, cells may only consist of 2, 4, or 8 vertices.The triangulation T can be the result of an external mesh generator or of the repeated reﬁne-ment of cells in a coarse mesh T in an octree manner (see Figures 1a-c). The latter way to createthe actual triangulation ﬁts into the concept of adaptive mesh reﬁnement and geometric multigridmethods (used in Section 6 for the Poisson problem), since the reﬁnement levels can also be used asmultigrid levels. The cells in the mesh are continuously enumerated, which leads to a space-ﬁllingcurve (see also Figure 1d). Distributed memory parallelization of ﬁnite-element codes is generally implemented based on adomain decomposition via the Message Passing Interface (MPI). For this purpose, the space-ﬁllingcurve of cells mentioned above can be partitioned—uniformly—among all processes within an MPIcommunicator (generally MPI COMM WORLD) so that a process i is only responsible for a subsetof (locally owned) cells (cid:93) i L i = C .For the management of large-scale distributed meshes, libraries based on space-ﬁlling curveslike p4est [7; 10], which uses the Morton order/z-curves, or Peano [35], which uses Peano curves,4nd graph-partitioning algorithms as implemented in

Metis [19] and

Zoltan [17] are utilized.Besides locally owned cells L i , each process has a halo of ghost cells G i ⊆ C with ∀ i : L i ∩G i = ∅ .Regularly, the degrees of freedom in these ghost cells have to be exchanged between neighbors,e.g., to evaluate ﬂuxes in DG methods, requiring communication (see also Subsection 2.3).Cells owned by a process can be processed in parallel by threads. They can exploit sharedmemory, however, they have to be synchronized to prevent race conditions. For this purpose, thelibraries TBB and

OpenMP are integrated in the deal.II library. In Subsection 4, we present a novelhybrid parallelization scheme, which exploits the shared-memory capabilities of MPI-3.0 with theadvantage that no threads have to be created explicitly.As the last type of parallelization, explicit vectorization for SIMD (single instruction stream,multiple data streams) can be used. A detailed explanation of SIMD in context of matrix-freemethods follows in Subsection 2.6.

Unknowns are assigned for all cells c ∈ C . We use d -dimensional scalar, discontinuous tensor-product shape functions of polynomial degree k , based on Gauss-Lobatto support points (see alsoFigure 2): P dk = P k ⊗ ... ⊗ P k (cid:124) (cid:123)(cid:122) (cid:125) × d . (7)In DG, unknowns are not shared between cells so that each cell holds ( k + 1) d unknowns. Thetotal number of degrees of freedom (DoF) is: N = |C| · ( k + 1) d . (8)The unknowns are coupled via ﬂuxes of DG schemes (see also Section 1). To be able to computethe ﬂuxes, the degrees of freedom of neighboring cells have to be accessed. The dependency regionfor computing all contributions of a cell is in the case of advection:( k + 1) d (cid:124) (cid:123)(cid:122) (cid:125) cell + 2 · d · ( k + 1) d − (cid:124) (cid:123)(cid:122) (cid:125) faces , (9)i.e., the union of all unknowns of the cell and the unknowns residing on faces of the 2 · d neighboringcells due to the nodal polynomials, which are combined with nodes at the boundary (see Figure 2). Due to the selection of nodal polynomials with nodes at the element faces, the amount of data (indoubles) to be transferred during ghost-value exchanges can be estimated as follows |L (cid:48) i | d − d ≤ ( data amount ) i · d · ( k + 1) d − ≤ |L i | , i = 0 , . . . , p − , (10)where p is the number of partitions. The lower bound is given by the data to be transferred onan idealized hypercube partition with |L (cid:48) i | ≈ ˆ N / ( k + 1) d cells, where ˆ N ≈ N/p is the averagednumber of locally owned degrees of freedom of process i , and the upper bound is deﬁned by thecase that all locally owned cells are unconnected. The total amount of data to be transferred,2 · N d − d · p d ≤ (cid:88) i ( data amount ) i , (11)can be minimized by well-shaped—hypercube-like—partitions and by subdomains containing alarge amount of cells. The latter approach competes with the domain-decomposition parallelizationapproach of a given mesh size. However, the amount of data exchange becomes smaller when usingshared memory (see also Section 4).The partitioning does not only inﬂuence the data volume to be transferred but also the memoryoverhead resulting from the need to allocate memory for at least two buﬀers of a size bounded5y Equation (11) for sending and receiving the data. Generally, one buﬀer is part of the actualdistributed solution vector, which makes the data access to the degrees of freedom of neighboringcells simpler during operator evaluations. At least on one side of the transfer, data has to bepacked and unpacked before or after each communication step, leading to the requirement of thesecond buﬀer [25]. Similar to the shape functions, arbitrary dimensional quadrature rules are expressed as a tensorproduct of 1D quadrature rules (with n q points). For the cell integral, we get Q dn q = Q n q ⊗ ... ⊗ Q n q (cid:124) (cid:123)(cid:122) (cid:125) × d (12)with the evaluation points given as (cid:126)x dn q = x n q ⊗ ... ⊗ x n q and the quadrature weights as w dn q = w n q · ... · w n q . For the face integrals of the faces 2 f and 2 f + 1, we have Q d − n q = Q n q ⊗ · · · ⊗ Q n q (cid:124) (cid:123)(cid:122) (cid:125) × ( f − ⊗Q f ⊗ Q n q ⊗ · · · ⊗ Q n q (cid:124) (cid:123)(cid:122) (cid:125) × ( d − f ) with Q f ∈ { , } . (13)A widespread approach is the Gauss–Legendre family of quadrature rules (see also Figure 2), whichare exact for polynomials of degree 2 n q −

1. Most eﬃcient implementations of these rules performa non-trivial interpolation operation from the Gauss–Lobatto to the Gauss–Legendre points. Sub-section 2.6 discusses an eﬃcient approach to perform this basis change and to compute gradientsat the quadrature points.An alternative is the Gauss–Lobatto family of quadrature rules, which do not require a basischange, however, are only exact for polynomials of degree 2 n q −

3. We will consider the Gauss–Legendre family of quadrature rules to quantify the overhead resulting from the basis change.To be able to evaluate Equation (5), the Jacobian matrix J q ∈ R d × d , which can be derivedfrom the element geometry, and its determinant |J q | ∈ R are needed at each cell quadrature point.At face quadrature points, the Jacobian determinant and the face normal (cid:126)n q ∈ R d are needed.These quantities are generally precomputed once during initialization. This leads to an additionalmemory consumption per ﬁnal unknown: n d · ( d + 1) + 2 · d · n d − ( d + 1)( k + 1) d = O ( d ) . (14)For aﬃne meshes, only one set of mapping quantities has to be precomputed and cached, sincethey are the same for all quadrature points. As we are considering complex non–aﬃne meshesin this paper, we will use simulations with these optimization techniques speciﬁc for aﬃne orCartesian grid only to quantify the quality of our implementation. Let us now turn to the cell integral of the advection operator A (see Equation 5) as an example: − (cid:18) J − T ∇ (cid:126)ξ v, |J | (cid:126)af (cid:19) Ω ( e )0 ≈ − (cid:88) q (cid:18) ∇ (cid:126)ξ v q , J − q w q |J q | (cid:126)a q f q (cid:19) Ω ( e )0 . (15)As visualized in Figure 2, the steps of the evaluation of cell integrals are as follows: 1) gather( k +1) d cell-local values f i , 2) interpolate values to quadrature points f q , 3) perform the operations J − q w q |J q | (cid:126)a q f q at each quadrature point, 4) test with the gradient of the shape functions, and 5)write back the local contributions into the global vector.The most eﬃcient implementations of the basis change (from Gauss–Lobatto to Gauss–Legendrepoints) perform a sequence of d

1D interpolation sweeps, utilizing the tensor-product form of theshape functions in an even-odd decomposition fashion with (3 + 2 · (cid:98) (( k − · ( k + 1) / (cid:99) / ( k − k + 1 = n q ) [25]. This operation is known as sum factorization and has its origin6 c Π Tc SS T Q c : J − Tq w q |J q | (cid:126)a q f q global DoFs local DoFs quadrature pointsGauss-Lobatto points Gauss-Legendre pointsdependency region of a cell+with Figure 2: Visualization of the 5 steps of a matrix-free cell-integral evaluation for polynomial degree k = 2 and number of quadrature points n q = 3.in the spectral-element community [13; 29; 30]. Similarly, the testing with the gradient of theshape functions can be performed eﬃciently with 2 d sweeps [25].The same ﬁve steps can be performed for all faces f ∈ F to compute the ﬂux between twoneighboring cells once (with the diﬀerence that steps 1–2 and 4–5 are performed for both sides ofa face) in a separate loop known as “face-centric loop” (FCL). However, recently the beneﬁts ofprocessing a cell and in direct succession processing all its 2 d faces (i.e., visiting all faces twice),as in the case of “element-centric loops” (ECL), have become clear for modern CPU processorarchitecture in the literature [25], although this kind of loop implies that ﬂuxes have to be computedtwice (for each side of an interior face). The reasons for the advantage of ECL are twofold: On theone hand, entries in the solution vector are written exactly only once back to main memory in thecase of ECL, while in the case of FCL at least once—despite of cache-eﬃcient scheduling of celland face loops—due to cache capacity misses. On the other hand, since each entry of the solutionvector is accessed only exactly once, no synchronization between threads is needed while accessingthe solution vector in the case of ECL. This absence of race conditions during writing into thedestination vector makes ECL particularly suitable for shared-memory parallelization. Since theexploitation of shared memory is a key ingredient for the reduction of the communication and asa consequence of the memory consumption (as already discussed in Subsection 2.4), we believethat ECL is a suitable approach to enable computations with larger problem sizes despite of thepossible increase in computations.One should also note that although ﬂuxes are computed twice in the case of ECL, this doesnot automatically translate into doubling of the computation, since values already interpolated tothe cell quadrature points can be interpolated to a face with a single 1D sweep.It is common practice to process a batch of v len cells by one instruction, known as vectoriza-tion over elements. This kind of processing can be accomplished by not operating directly on theprimitive types double / float but on structs built around intrinsic instructions, with each vectorlane dedicated to a separate cell of mesh. Depending on the given hardware, up to v len =2 (SSE2),=4 (AVX), and =8 doubles (AVX-512 instruction-set extension—as most modern Intel-based pro-cessors have) can be processed by a single instruction. Although the working-set size increasesby a factor of v len [20], this vectorization strategy showed a better performance than alternativevectorization strategies in 2D and 3D (for their description see [25]) and the auto-vectorization bythe compiler.As the conclusion of this section, Table 1 gives a rough estimate of the working sets for thematrix-free evaluation of the advection operator at diﬀerent stages of the algorithm. Table 2shows the estimated minimal memory consumption of the complete simulation. The number ofghost degrees of freedom also gives an estimate for the amount of data to be communicated. The descriptions in Sections 1 and 2 regarding ﬁnite-element methods have been general for d -dimensional space Ω. As a consequence, all algorithms also work in high dimensions. However,we face two major challenges: The ﬁrst one is the lack of libraries designed for high dimensions.For example, the libraries deal.II and p4est are limited to dimensions up to three. If we would7able 1: Estimated working set of diﬀerent stages of a matrix-free evaluation of the advectionoperator for ECL and vectorization over elements: (1) includes both the source and the destinationelement vector, (2) includes the buﬀers needed during testing and face evaluation, (3) includes thedegrees of freedom of neighboring cells, needed during ﬂux computation. stage working set (1) sum factorization > v len · max (cid:0) ( k + 1) d , n dq (cid:1) (2) derived quantities > ( d + 1) · v len · n dq (3) ﬂux computation (cid:29) (2 · d + 1) · v len · n dq Table 2: Estimated minimal memory consumption for d -dimensional simulations with N degreesof freedom on p processes. reason times amount (1) ghost DoFs (+buﬀer) 2 · · d · N d − d · p d (2) vector + 2 · N (3) mapping + d · N “na¨ıvely” extend these libraries for high dimensions, the second problem would be that somealgorithms do not scale to higher dimensions. The following speciﬁc diﬃculties would arise:(1) Signiﬁcant memory overhead due to ghost values and mapping : In high dimensions,solution vectors (2–3 are needed, depending on the selected time discretization scheme) arehuge ( O ( N d D )), with N D the number of degrees of freedom in each direction, which areneeded to achieve the required resolution. Also the ghost values and the mapping havesigniﬁcant memory requirements in high dimensions: The evaluation of the advection cellintegral on complex geometries needs among other things the Jacobian matrix of size O ( d ) ateach quadrature point. If precomputed, this implies an at least 36-fold memory consumptionas the actual vector in 6D. For high- dimensional problems, this is not feasible, as only littlememory would remain for the actual solution vectors and only problems with signiﬁcantlysmaller resolutions could be solved.(2) Increased ghost-value exchange due to increased surface-to-volume ratio : Thecommunication amount scales—according to Equation (11)—with 2 · d · N ( d − /d · p /d . Ac-cording to [25], the MPI ghost-value exchange already leads to a noticeable share of timein purely MPI-parallelized applications (30% for Laplacian) in comparison to the highly ef-ﬁcient matrix-free operator evaluations if computations are performed on a single computenode. For high dimensions, the situation is even worse: an estimation with d = 6, N = 10 , p = 1024 ·

48 (1024 compute nodes with 48 processes each) gives that the size of the ghostvalues is at least 72% of the size of the actual solution vector. The usage of shared memoryis inevitable to decrease the memory consumption and the time spent in communication: Forthe given example, the size of ghost values could be halved to 37% if all 48 processes on acompute node shared their locally owned values.(3)

Decreased eﬃciency of the operator evaluation due to exponential increasing sizeof working sets : The working set of a cell is at least O (max(( k + 1) d , n dq )) (cf. Table 1) sothat for high order and/or dimension the data eventually drops out of the cache during eachsum-factorization sweep of one cell.While this work cannot solve all these diﬃculties, we will show how it is possible to mitigate them:We address problem ((1)) by restricting ourselves to the tensor product of two grids in 1-3D.This reduces the size of the mapping data and makes it, moreover, possible to reuse much of theinfrastructure available in the baseline library deal.II . We will describe in the next section howsuch a tensor product can be formed. Problem ((2)) demonstrates that it is essential to exploitshared-memory parallelism in particular in high dimensions. In Section 4, we propose a novelshared-memory implementation that is based on MPI-3.0. To mitigate problem ((3)), we try tominimize the number of cache misses due to increased working set sizes by reorganizing the loops.In this respect, we will demonstrate the beneﬁt of vectorizing over fewer elements than the given8nstruction-set extensions allow. We accomplish this by explicitly using narrower instructions, suchas AVX2 or SSE2 , instead of

AVX-512 or by working directly with double s and relying on auto-optimization of the compiler. We defer the investigation of explicit vectorization within elementsto future work.

In this section, we explain how a tensor product of two meshes can be used to solve problems in upto six dimensions. In our sample application of an advection equation in phase space, separatingthe meshes in conﬁguration space as well as in velocity space is natural, which is why we use theindices (cid:126)x and (cid:126)v for the two parts of the dimensions.In the following, we assume that our computational domain is given as a tensor product oftwo domains Ω := Ω (cid:126)x ⊗ Ω (cid:126)v . The boundary of the high-dimensional domain is then described byΓ := (Γ (cid:126)x ⊗ Ω (cid:126)v ) ∪ (Ω (cid:126)x ⊗ Γ (cid:126)v ). Let us reformulate and simplify the discretized advection equation(see Equation. (5)) for the phase space, exploiting the fact that the Jacobian matrix J is a block-diagonal matrix in phase space, J = (cid:18) J (cid:126)x J (cid:126)v (cid:19) , (16)with the blocks being the respective matrices of the (cid:126)x -space and the (cid:126)v -space. As a consequence,the inverse J − is the inverse of each of its blocks and its determinant |J | is the product of thedeterminants of each of its blocks. Exploiting the block-diagonal structure of J , some operationsin Equation (5) can be simpliﬁed: J − ( (cid:126)af ) = (cid:18) J − (cid:126)x J − (cid:126)v (cid:19) (cid:18) (cid:126)a (cid:126)x (cid:126)a (cid:126)v (cid:19) f = (cid:18) J − (cid:126)x (cid:126)a (cid:126)x J − (cid:126)v (cid:126)a (cid:126)v (cid:19) f. (17)Furthermore, face integrals over the faces in phase space Γ ( e ) = (Γ ( e ) (cid:126)x ⊗ Ω ( e ) (cid:126)v ) ∪ (Ω ( e ) (cid:126)x ⊗ Γ ( e ) (cid:126)v ) can besplit into integration over (cid:126)x -space faces Γ ( e ) (cid:126)x ⊗ Ω ( e ) (cid:126)v and integration over (cid:126)v -space faces Ω ( e ) (cid:126)x ⊗ Γ ( e ) (cid:126)v .The following relation is true for (cid:126)x -space faces: Due to (cid:126)n (cid:62) = ( (cid:126)n (cid:62) ,

0) for (cid:126)x -space faces, (cid:126)n · (cid:126)a = (cid:126)n (cid:126)x · (cid:126)a (cid:126)x ;and the same is true for (cid:126)v -space faces (cid:126)n · (cid:126)a = (cid:126)n (cid:126)v · (cid:126)a (cid:126)v . Finally, Equation (5), which can be solvedeﬃciently with hyper.deal , is: (cid:18) g, ∂f∂ t (cid:19) Ω ( e ) − (cid:18) ∇ ξ g, (cid:18) J − (cid:126)x (cid:126)a (cid:126)x J − (cid:126)v (cid:126)a (cid:126)v (cid:19) f (cid:19) Ω ( e ) = (cid:18) g, (cid:126)n (cid:126)x · ( (cid:126)a (cid:126)x f ) ∗ (cid:19) Γ ( e ) (cid:126)x ⊗ Ω ( e ) (cid:126)v + (cid:18) g, (cid:126)n (cid:126)v · ( (cid:126)a (cid:126)v f ) ∗ (cid:19) Ω ( e ) (cid:126)x ⊗ Γ ( e ) (cid:126)v . (18) Naturally, both domains Ω (cid:126)x and Ω (cid:126)v can be meshed separately. Note, however, that this restricts thepossibilities of mesh reﬁnement to the two spaces separately from each other. As a consequence, theﬁnal triangulation results from the tensor product of the two triangulations T (cid:126)x and T (cid:126)v (visualizedin Figure 3), T := T (cid:126)x ⊗ T (cid:126)v . (19)In this context, cells C , inner faces I , and boundary faces B are deﬁned as C := C (cid:126)x ⊗ C (cid:126)v , I := ( I (cid:126)x ⊗ C (cid:126)v ) ∪ ( C (cid:126)x ⊗ I (cid:126)v ) , B := ( B (cid:126)x ⊗ C (cid:126)v ) ∪ ( C (cid:126)x ⊗ B (cid:126)v ) . (20)The cells in both low-dimensional triangulations are enumerated independently along space-ﬁllingcurves. The enumeration of the cells of T can be chosen arbitrarily. However, in Subsection 3.2we propose an order that simpliﬁes the initial setup. Note: By simplifying the d (cid:126)x -dimensional triangulation T (cid:126)x and the d (cid:126)v -dimensional triangulation T (cid:126)v to a 1D space-ﬁlling curve in each case and by taking the tensor product of these curves, we get—depending on the point of view—a 2D Cartesian grid or a matrix. As a result, we use well-known v T x c v c x Figure 3: On-the-ﬂy mesh generation of a distributed triangulation in hyper.deal by takingthe tensor product of two low-dimensional triangulations from the base library deal.II . Cells areordered lexicographically within a process (as are the processes themselves), leading to the depictedspace-ﬁlling curve. T v c v c x T x L G Figure 4: Local view of an arbitrary process: local cells and ghost cells

00 11 22 33 44 556 7 8 9 10 1112 13 14 15 16 1718 19 20 21 22 2301230 12 30 12 3 45 column comm r o w c o mm global comm sm commsm comm MPI_COMM_WORLD

24 25

Figure 5: Four MPI communicators used in hyper.deal ( global , row , column , sm comm) for ahypothetical setup of 26 ranks in MPI COMM WORLD and of a 6 × oncepts from these ﬁelds of research in the library hyper.deal . However, it should be emphasizedthat, the analogy is not immediate. While the neighborship of a cell on a 2D grid is clear, this isnot the case for the high-dimensional triangulation that is depicted as a 2D grid: The number ofneighbors is signiﬁcantly larger, and neighboring cells might be disjunct in this case (see Figure 4).What is true for individual cells is also true for partitions so that the communication patterns are(structured but) signiﬁcantly more complicated and communication over the boundaries of nodes isunavoidable. Since we are using a domain decomposition from the base library deal.II for T (cid:126)x and T (cid:126)v , we assumethese meshes are already split up independently into p (cid:126)x and p (cid:126)v subpartitions with T (cid:126)x = (cid:93) ≤ i

Zoltan might be beneﬁcial.In Subsection 2.7, we have discussed the importance of the usage of shared memory for solvinghigh-dimensional problems. Placing ranks according to (cid:98) f ( i, j ) /p node (cid:99) onto the same computenode (with p node being the number of processes per node) leads to striped partitioning (see thecolors of the blocks in Figure 5). This results in a suboptimal shape of the union of subpartitionsbelonging to the same compute node, leading to decreased beneﬁt of the usage of shared memory.In order to improve the placing of subpartitions onto the compute nodes without having to changethe function f , we are operating on a virtual topology, as will be presented in Section 4. Since we are considering shape functions that are derived by the tensor product of 1D-shapefunctions, i.e., P d (cid:126)x k = P k ⊗ · · · ⊗ P k (cid:124) (cid:123)(cid:122) (cid:125) × d (cid:126)x and P d (cid:126)v k = P k ⊗ · · · ⊗ P k (cid:124) (cid:123)(cid:122) (cid:125) × d (cid:126)v , the extension to higher dimensionsis trivial P d (cid:126)x + d (cid:126)v k = P k ⊗ · · · ⊗ P k (cid:124) (cid:123)(cid:122) (cid:125) × d (cid:126)x ⊗ P k ⊗ · · · ⊗ P k (cid:124) (cid:123)(cid:122) (cid:125) × d (cid:126)v = P d (cid:126)x k ⊗ P d (cid:126)v k . (23)11able 3: Comparison of the memory consumption (in doubles) of the mapping data (per quadraturepoint) if the phase-space structure is exploited ( J (cid:126)x and J (cid:126)v ) and if the phase-space structure is notexploited ( J ). J (cid:126)x and J (cid:126)v J example: k = 3, d = 6Jacobian: ( k + 1) d (cid:126)x · d (cid:126)x + ( k + 1) d (cid:126)v · d (cid:126)v ( k + 1) d (cid:126)x + d (cid:126)v ( d (cid:126)x + d (cid:126)v ) (cid:28) This relationship is also true for the quadrature formulas so that a basis change can be performed—as usual—with d (cid:126)x + d (cid:126)v sum-factorization sweeps.As a consequence, only the mapping data from lower-dimensional spaces (e.g., Jacobian matrixand its determinant) have to be precomputed and can be reused, which leads to a signiﬁcantlyreduced memory consumption even for complex geometries as shown in Table 3. The extension of operator evaluations with sum factorization to higher dimensions is also straight-forward. Instead of looping over all cells of a triangulation, it requires an iteration over all possiblepairs of cells from both low-dimensional triangulations ( c (cid:126)x , c (cid:126)v ) ∈ C (cid:126)x × C (cid:126)v in the case of ECL. Inaddition to cell pairs, FCL iterates also over cell-face and cell-boundary-face pairs as expressed byEquation (20). This comes in handy, since the mapping information of the cells c (cid:126)x and c (cid:126)v as wellas of the faces f (cid:126)x and f (cid:126)v can be queried from the low-dimensional library independently and is onlycombined on the ﬂy. The separate cell IDs c (cid:126)x and c (cid:126)v only have to be combined when accessing thesolution vector, which is the only data structure set up for the whole high-dimensional space.Algorithm 1–2 show the pseudocode of a possible matrix-free advection operator evaluation.As an example, Algorithm 1 contains an element-centric loop (ECL) iterating over all cell pairsand calling the function that should be evaluated on the cell pairs. Up to here, the algorithm isindependent of the equation to be solved; the equation comes into play (apart from the speciﬁcghost-value update) by the called function, which might be the advection operator in Algorithm 2.Furthermore, we note that lines 2–4, 8, 10, 12 in Algorithm 2 are evaluated with sum factor-ization. To reduce the working set, we do not compute all ( d (cid:126)x + d (cid:126)v )-derivatives at once, but ﬁrstwe compute the contributions from (cid:126)x -space and then the contributions from (cid:126)v -space. One couldreduce the working set even more by loop blocking [25], but we deal with a generic variant here. The library hyper.deal provides classes that contain inter alia utility functions needed in Al-gorithm 2 and are built around deal.II classes. To enable a smooth start for users alreadyfamiliar with deal.II , we have chosen the same class and function names living in the namespace hyperdeal . The relationship between some hyper.deal classes and some deal.II classes is visual-ized in the UML diagram in Figure 6. The class hyperdeal::MatrixFree is responsible for loopingover cells (and faces) as well as for storing precomputed information related to shape functionsand precomputed quantities at the quadrature points. The classes hyperdeal::FEEvaluation and hyperdeal::FEFaceEvaluation (not shown) contain utility functions for the current cellsand faces: These utility functions include functions to read and write cell-/face-local values froma global vector as well as also operations at the quadrature points. As an example, Figure 6 showsthe implementation of the hyperdeal::FEEvaluation::submit gradient() method, which usesfor the evaluation of f ( (cid:126)u ) = J − c,q |J q | w q (cid:126)u, (cid:126)u ∈ R d , two instances of the deal.II class with thesame name—one for (cid:126)x - and one for (cid:126)v -space.We use “vectorization over elements” from deal.II as vectorization strategy. To be precise,we vectorize only over elements in (cid:126)x -space, whereas (cid:126)v -space is not vectorized. Note that in theVlasov–Maxwell or Vlasov–Poisson model, where parts of the code operate on the full phase spaceand other parts on the (cid:126)x -space only, it is important to vectorize over (cid:126)x . Then, the data structuresare already laid out correctly for an eﬃcient matrix-free solution of the lower-dimensional problem.12 lgorithm 1: Element-centric loop for arbitrary operators update ghost values : Import vector values of (cid:126)u from MPI processes that are adjacent to locallyowned cells /* loop over all cell pairs */ foreach ( e (cid:126)x , e (cid:126)v ) ∈ C (cid:126)x × C (cid:126)v do process cell( e (cid:126)x , e (cid:126)v ) ; /* e.g., advection cell operator in Algorithm 2 */ Algorithm 2:

DG integration of a cell batch for advection operator evaluation for ECL andfor vectorization over elements /* step 1: gather values (AoS → SoA) */ gather local vector values u ( e ) i on the cell from global input vector (cid:126)u /* step 2: apply advection cell contributions */ interpolate local vector values (cid:126)u ( e ) onto quadrature points, u eh ( (cid:126)ξ q ) = (cid:80) i φ i u ( e ) i for each quadrature index q = ( q (cid:126)x , q (cid:126)v ), prepare integrand on each quadrature point by computing (cid:126)t q = J − e (cid:126)x ) (cid:126)c (cid:126)x (cid:16) ˆ x ( e (cid:126)x ) ( (cid:126)ξ q (cid:126)x ) , ˆ x ( e (cid:126)v ) ( (cid:126)ξ q (cid:126)v ) (cid:17) u ( e ) h ( (cid:126)ξ q ) |J q (cid:126)x ||J q (cid:126)v | w q (cid:126)x w q (cid:126)v (cid:124) (cid:123)(cid:122) (cid:125) ” |J ( q ) | w q ” and evaluate local integrals byquadrature b i = (cid:16) ∇ (cid:126)x φ coi , (cid:126)c (cid:126)x u ( e ) h (cid:17) Ω ( e ) ≈ (cid:80) q ∇ (cid:126)x φ coi ( (cid:126)ξ q ) · (cid:126)t q ; /* buffer (cid:126)b */ for each quadrature index q = ( q (cid:126)x , q (cid:126)v ), prepare integrand on each quadrature point by computing (cid:126)t q = J − e (cid:126)v ) (cid:126)c (cid:126)v (cid:16) ˆ x ( e (cid:126)x ) ( (cid:126)ξ q (cid:126)x ) , ˆ x ( e (cid:126)v ) ( (cid:126)ξ q (cid:126)v ) (cid:17) u ( e ) h ( (cid:126)ξ q ) |J q (cid:126)x ||J q (cid:126)v | w q (cid:126)x w q (cid:126)v (cid:124) (cid:123)(cid:122) (cid:125) ” |J ( q ) | w q ” and evaluate local integrals byquadrature b i = b i + (cid:16) ∇ (cid:126)v φ coi , (cid:126)c (cid:126)x u ( e ) h (cid:17) Ω ( e ) ≈ b i + (cid:80) q ∇ (cid:126)v φ coi ( (cid:126)ξ q ) · (cid:126)t q /* step 3: apply advection face contributions (loop over all 2 d faces of Ω e ) */ foreach f ∈ F ( e ) do interpolate values from cell array (cid:126)u ( e ) to quadrature points of face if not on boundary, gather values from neighbor Ω e + of current face interpolate u + onto face quadrature points compute numerical ﬂux and multiply by quadrature weights ; /* not shown here */ evaluate local integrals related to cell e by quadrature and add into cell contribution b i end /* step 4: apply inverse mass matrix */ for each quadrature index q = ( q (cid:126)x , q (cid:126)v ), prepare integrand on each quadrature point by computing t q = b i w − q (cid:126)x w − q (cid:126)v and evaluate local integrals by quadrature y ( e ) i = (cid:80) q ˜ φ iq · (cid:126)t q with ˜ φ iq = V − iq with V iq = φ i ( (cid:126)ξ q ) /* step 5: scatter values (SoA → AoS) */ set all contributions of cell, (cid:126)y ( e ) , into global result vector (cid:126)y The parallelization of the library hyper.deal is purely MPI-based with one MPI process perphysical core. Each MPI process possesses its own subpartition and has a halo of ghost faces(see Figure 1 and 4). MPI allows to program for distributed systems, which is crucial for solvinghigh-dimensional problems due to their immense memory requirements. A downside of a purelyMPI-based code is that many data structures (incl. ghost values) are created and updated multipletimes on the same compute node, although they could be shared. In FEM codes, it is widespreadthat ghost values ﬁlled by standard

MPI (I)Send / MPI (I)Recv reside in an additional section ofthe solution vector. Depending on the MPI implementation, these operations will be replacedby eﬃcient alternatives (e.g., based on memcpy —if the calling peers are on the same computenode. Nevertheless, the allocation of additional memory—if main memory is scarce—might beunacceptable.Adding shared-memory libraries like

TBB and

OpenMP to an existing

MPI program would allowto use shared memory, however, this comes with an overhead for the programmer, since all paral-lelizable code sections have to be found and transformed according to the library used, includingthe diﬃculty when some third-party numerical library like an iterative solver package only relies13 advection cell operation: ∀ q := ( q (cid:126)x , q (cid:126)v ) f ( (cid:126)u ) = J − c,q |J q | w q (cid:126)u, (cid:126)u ∈ R d f ( (cid:126)u ) = (cid:18) J − c (cid:126)x ,c (cid:126)x J − c (cid:126)v ,q (cid:126)v (cid:19) |J c (cid:126)x ,q (cid:126)x | w q (cid:126)x |J c v ,q v | w q (cid:126)v (cid:126)u, (cid:126)u ∈ R d hyperdeal::MatrixFree loop element centric() : void // ECLloop() : void // FCL dealii::MatrixFree

ShapeInfoMappingInfo hyperdeal::FEEvaluation reinit(cell)read dof values(src)distribute local to global(dst)get value < • > (val, q, qx, qv)submit gradient < • > (val, q, qx, qv) dealii::FEEvaluation quadrature point(q)JxW(q)jacobian(q) ∀ c := ( c (cid:126)x , c (cid:126)v ) Figure 6: Class diagram of part of the matrix-free infrastructure of hyper.deal . It presents howclasses from hyper.deal (namespace hyperdeal — highlighted blue) and from deal.II (namespace dealii — highlighted yellow) relate to each other. Only the hyper.deal methods are shown thatare relevant for the evaluation of the advection operator and the deal.II methods that are usedin those. L i L ∪ G L ∪ G L ∪ G ∪ L ∪ G Figure 7: Shared-memory domain: a) partitioned distributed triangulation; b-c) locally relevantcells of ranks 4 and 5; d) locally relevant cells of shared-memory domain (containing ranks 4 and 5)with a shared ghost layer and without ghost cells between the processes in the same shared-memorydomain. rank 0 rank 1 rank 2 rank 3node 0 node 1memcpy (a) A hybrid approach using MPI-3.0 shared-memory features ( buffered mode ). Ghost values are up-dated via send/recv between nodes or explicitly via memory copy within the node. The similarity to astandard

MPI implementation is clear with the diﬀerence that memcpy is called directly by the program,making packing/unpacking of data superﬂuous. rank 0 rank 1 rank 2 rank 3node 0 node 1 (b) A hybrid approach using MPI-3.0 shared-memory features ( non-buffered mode ) similar to 8a, withthe diﬀerence that only ghost values that live in diﬀerent shared-memory domains are updated. Ghostvalues living in the same shared-memory domain are directly accessed only when needed. The similarityto a thread-based implementation is clear with the diﬀerences that vectors are non-contiguous, requiringan indirect access to values owned by other processes, and that each process may manage its own ghostvalues and send/receive its own medium-sized messages needed to fully utilize the network controller.

Figure 8: Two hybrid ghost-value-update approaches for a hypothetical setup with 2 nodes, eachwith two cores. Only the communication pattern of rank 2 is considered.14 /00 1/11 4/22 5/33 16/44 17/552/6 3/7 6/8 7/9 18/10 19/118/12 9/13 12/14 13/15 20/16 21/1710/18 11/19 14/20 15/21 22/22 23/230123 column comm r o w c o mm global/renumbered comm sm comm renumbered comm (virtual topology) global comm Figure 9: Renumbering of ranks in the global communicator (via

MPI Comm split ): For a hypo-thetical setup of 26 ranks in

MPI COMM WORLD and of a 6 × × MPI .Considering a purely MPI-parallelized FEM application, one can identify that the major timeand memory beneﬁt of using shared memory would come from accessing the part of the solutionvector owned by the processes on the same compute node without the need to make explicitcopies and buﬀering them. This is why we propose a new vector class that uses

MPI-3.0 featuresto allocate shared memory and provides controlled access to it, while retaining the same vectorinterface and adding only a few new methods.

MPI provides the function

MPI WIN ALLOCATE SHARED to allocate non-contiguous memory thatis shared among all processes in the subcommunicator containing all MPI processes on the samecompute node. To query the beginning of the local array of each process, it provides the function

MPI WIN SHARED QUERY .These two functions enable us to allocate memory needed by each process for its locally ownedcells and for the ghost faces that are not owned by any process on the same compute node aswell as to get hold of the beginning of the arrays of the other processes. Appendix A providesfurther implementation details of the allocation/deallocation process of the shared memory in thevector class. With basic preprocessing steps, the address of each cell residing on the same computenode can be determined so that the processes have access to all degrees of freedom owned by thatcompute node (see Figure 7). A natural way to access the solution vector is by specifying vectorentry indices and the cell ID for degrees of freedom owned by a cell or by specifying a pair of a cellID and a face number ( < d ) for degrees of freedom owned by faces.In hyper.deal , the class dealii::LinearAlgebra::SharedMPI::Vector manages the sharedmemory and also provides access to the values of the degrees of freedom of the local and the ghostcells: It returns either pointers to buﬀers or the shared memory, depending on the cell type (seeFigure 8b). In this way, the user of the vector class gets the illusion of a pure MPI program, sincethe new vector has to be added only at a single place and only a few functions querying values fromthe vector (e.g., read dof values and distribute local to global in Figure 6)—oblivious tothe user— have to be specialized.We provide two operation modes: • In the buffered mode (see Figure 8a), memory is allocated also for ghost values ownedby the same compute node; these ghost values are updated directly via memcpy , without anintermediate step via

MPI . This mode is necessary if ghost values are modiﬁed, as it takes placein face-centric loops. It promises some performance beneﬁt, since data packing/unpackingcan be skipped. • The non-buffered mode (see Figure 8b) does not allocate any redundant memory for ghostvalues owned by the same compute node. This mode works perfectly with ECL, since it isby design free of race conditions. Due to the harmony of ECL and the non-buffered mode of the shared-memory vector, we rely on this vector mode in the rest of this paper.In order to make this type of hybrid parallelization approach successful, the union of thesubpartitions of all processes on a compute node should build a well-formed subpartition on its15wn with an improved surface-to-volume ratio. We achieve this via blocking within a Cartesianvirtual topology (see also Figure 9), e.g., by 48 process blocks with 8 processes in (cid:126)x -space and with6 processes in (cid:126)v -space. Compute nodes are ordered along a z -curve.As a ﬁnal remark, we emphasize that a vector built around MPI-3.0 shared-memory featuresis not limited to high dimensions, but can be used also in lower dimensions; in particular, if notonly a single “ghost layer” has to be exchanged (as in the case of an advection operator) butmultiple “ghost layers” (as in the case of a Laplace operator discretized with the interior penaltymethod [26]).

In the following section, we show results of the solution of a high-dimensional scalar transportproblem. These results conﬁrm the suitability of the underlying concepts and the implementationof the library hyper.deal for high orders and high dimensions. Both node-level performance andparallel performance results are shown, including the ﬁndings of strong and weak scaling analyseswith up to 147,456 processes on 3,072 compute nodes.

The setup of the simulations is as follows. We consider the computational domains Ω (cid:126)x =[0 , d (cid:126)x and Ω (cid:126)v =[0 , d (cid:126)v with the following decomposition of the dimensions d = d (cid:126)x + d (cid:126)v : 2 = 1 + 1,3 = 2 + 1, 4 = 2 + 2, 5 = 3 + 2, 6 = 3 + 3. The computational domains are initially meshedseparately with subdivided d (cid:126)x / d (cid:126)v -dimensional hyperrectangles– with (2 l , . . . , l d(cid:126)x ) ∈ N d (cid:126)x and(2 l d(cid:126)x +1 , . . . , l d(cid:126)x + d(cid:126)v ) ∈ N d (cid:126)v hexahedral elements in each direction and with a diﬀerence in the meshsize of at most two, i.e., meshed for 4D from the mesh sequence ( l , l , l , l ): (1 , , , , , , , , , , , , , , , , , , d / polynomial degree k ” conﬁguration in such a way that the solution vectors do not ﬁt into thecache. To obtain unique Jacobian matrices at each quadrature points ( J = J ( (cid:126)x )) and to preventthe usage of algorithms explicitly designed for aﬃne meshes, we deform the Cartesian mesh slightly.The velocity (cid:126)a in Equation (3) is set constant and uniform over the whole domain ( (cid:126)a (cid:54) = (cid:126)a ( t, (cid:126)x )).The measurement data have been gathered either with user-deﬁned timers or with the help ofthe script likwid-mpirun from the LIKWID suite and with suitable in-code

LIKWID API annota-tions [31; 33]. The following metrics are used to quantify the quality of the implementations: • throughput : processed degrees of freedom per time unitthroughput = processed DoFstime Eqn. (8) = |C| · ( k + 1) d time (24)(In Subsections 5.2–5.4, we consider the throughput for the application of the advectionoperator, while a single Runge–Kutta stage, i.e., the evaluation of the advection operatorplus vector updates, is considered in Subsection 5.5.); • performance : maximum number of ﬂoating-point operations per second; • data volume : the amount of data transferred within the memory hierarchy (Here we con-sider the transfer between the L1-, L2-, and L3-caches as well as the main memory.); • bandwidth : data volume transferred between the levels in the memory hierarchy per time.Our main objective is to decrease the time-to-solution and to increase the throughput (atconstant error, not considered in this paper). The measured quantities “performance”, “datavolume”, and “bandwidth” are useful, since they show how well the given hardware is utilizedand how much additional work or memory transfer is performed compared to the theoreticalrequirements of the mathematical algorithm.The focus of this study is on the performance of hyper.deal for high-order and high-dimensionalproblems that have a large working set ( k +1) d . The evaluation of this expression for 2 ≤ k ≤ ≤ d ≤ k - d conﬁgurations with working-set size of v len · ( k +1) d < L ,16able 4: Degrees of freedom per cell: ( k + 1) d . The k - d conﬁgurations with working-set size v len · ( k + 1) d < L are highlighted in italics and conﬁgurations with working-set size L ≤ v len · ( k + 1) d < L in bold. Hardware characteristics ( v len , L , L ) are taken from Table 5. k / d

16 64 256

25 125

625 3125 15625

36 216

Table 5: Speciﬁcation of the hardware system used for evaluation with turbo mode enabled. Mem-ory bandwidth is according to the STREAM triad benchmark (optimized variant without read forownership transfer involving two reads and one write), and GFLOP/s are based on the theoreticalmaximum at the AVX-512 frequency. The dgemm performance is measured for m = n = k = 12 , dgemm performance and STREAM bandwidth from RAM memory. Intel Skylake Xeon Platinum 8174cores 2 × dgemm performance) 4147 GFLOP/s (2920 GFLOP/s)memory interface DDR4-2666, 12 channelsSTREAM memory bandwidth 205 GB/sempirical machine balance 14.3 FLOP/ByteL1-/L2-/L3-/MEM size 32kB/1MB/66MB (shared)/96GB(shared)compiler + compiler ﬂags g++ , version 9.1.0, -O3 -funroll-loops -march=skylake-avx512 ﬁtting into L1 cache (highlighted in italics), are expected to show good performance; the k - d con-ﬁgurations with working-set size of L ≤ v len · ( k + 1) d < L (highlighted in bold) are expected tobe performance-critical, since each sum-factorization sweep might drop out of the cache. These lat-ter conﬁgurations are, however, the most relevant with regard to high-order and high-dimensionalproblems.All performance measurements have been conducted on the SuperMUC-NG supercomputer.Its compute nodes have 2 sockets (each with 24 cores of Intel Xeon Skylake), a measured band-width of 205GB/s to main memory, and the AVX-512 ISA extension so that 8 doubles can beprocessed per instruction. A detailed speciﬁcation of the hardware is given in Table 5. An islandconsists of 792 compute nodes. The maximum network bandwidth per node within an island is100GBit/s=12.5GB/s due to the fat-tree network topology. Islands are connected via a prunedtree network architecture (pruning factor 1:4).The library hyper.deal has been conﬁgured in the following way: All processes of a node aregrouped, and they build blocks of the size of 48=8 ×

6. All processes in these blocks share theirvalues via the shared-memory vector. We use the highest ISA extension AVX-512 so that 8 cellsare processed at once. Jacobian matrix and its determinant are precomputed for (cid:126)x - and (cid:126)v -spaceand combined on the ﬂy. The quadrature is based on the Gauss-Lagrange formula with n q = k + 1.In the following, we refer to this conﬁguration as “default conﬁguration”. This subsection takes an in-depth look at the cell-local computation in the element-centric evalua-tion of the advection operator (see Algorithm 2). Cell-local computations do not access non-cachedvalues of neighboring cells, making it easier to reason about the number of sweeps and the working-set size (see the ﬁrst two working-set sizes in Table 1), which increase with the dimension. https://doku.lrz.de/display/PUBLIC/SuperMUC-NG = 3 k = 5 L1 ↔ L2 L2 ↔ L3 Caches ↔ MEM + · d ( G Q ) ideal - 3 dim d a t a v o l u m e [ d o ub l e / D o F ] (a) Data volume per degree of freedom. dim b a nd w i d t h [ G B / s ec ] (b) Accumulated bandwidth. · d · d dim F L O P s / D o F (c) Floating-point operations per degree of freedom. t h r o u g hpu t [ G D o F s / s ec ] (d) Throughput. Figure 10: Node-level analysis of the cell integrals of the advection operator

We ﬁrst consider all steps in Algorithm 2 related to the cell integrals (lines 1-4, 12, 13), skippingthe loops over faces and ignoring the ﬂux computation.During the cell integrals, values are read from the global vector, a basis change to the Gauss–Legendre quadrature points is performed (with ≈ d data sweeps for reading and ≈ d for writing),the values obtained are multiplied with the velocities at the quadrature points ( ≈ d ) and testedby the gradient of the collocation functions ( ≈ d ), the inverse mass matrix is applied ( ≈ d ),and ﬁnally the results are written back to the global vector. A total of ≈ d data sweeps arenecessary if reading and writing are counted separately. The working set of sum factorization is v len · ( k + 1) d + d , and the working set of the intermediate values is v len · max( d , d ) · ( k + 1) d + d .A comparison with hardware statistics shows that the working set of sum factorization exceeds thesize of the L1 cache for conﬁgurations k = 3 / d = 5 and k = 5 / d = 4 so that every data sweephas to fetch the data from the L2 cache.The theoretical considerations made above are supported by the measurement results in Fig-ure 10, which shows the data traﬃc between the memory hierarchy levels (data volume per DoFand bandwidth), the ﬂoating-point operations per DoF, and the throughput for k = 3 and k = 5for 2 ≤ d ≤ k = 3 / d = 5 and k = 5 / d = 4), thedata has to be fetched from and written back to L2 cache again during every sweep, resulting in adata volume traﬃc between L1 and L2 caches that linearly increases with the number of sweeps.The constant oﬀset of 23 double/DoF is mainly related to the access to the global solution vector.However, data also has to be loaded from L2 cache for smaller working-set sizes than the ones ofthe k - d conﬁgurations mentioned above; the main reason for this is that the intermediate valuesdo not ﬁt into the cache any more.For even higher dimensions, the L3 cache and main memory have to be accessed during the18 = 3 k = 5 L1 ↔ L2 L2 ↔ L3 Caches ↔ MEM ideal - 3 dim d a t a v o l u m e [ d o ub l e / D o F ] (a) Data volume per degree of freedom. dim b a nd w i d t h [ G B / s ec ] (b) Accumulated bandwidth. F L O P s / D o F (c) Floating-point operations per degree of freedom. t h r o u g hpu t [ G D o F s / s ec ] (d) Throughput. Figure 11: Full operator evaluation without loading values from neighboring cellssweeps. While this operation is negligible for k = 3 (see Figure 10d), it is performance-limiting inthe case of k = 5: For k = 5 / d = 5, the bandwidth to L1 cache is limited by the access to L2cache (see Figure 10b); for k = 5 / d = 6, it is even limited by the main memory. In the lattercase, the caches are hardly utilized any more and the data has to be fetched from/written backto main memory during every sweep, leading to a bandwidth close to the values measured for theSTREAM benchmark, however, also resulting in a signiﬁcant performance drop.Figure 10c also shows the number of ﬂoating-point operations performed per degree of freedom.It increases linearly with the dimension d —with higher polynomial degrees requiring more work.It is clear that the arithmetic intensity will also increase linearly as long as the data stays in thecache (see also Subsection 5.3). In this subsection, we consider all computation steps in Algorithm 2, but ignore the data accessto neighboring cells (line 7). This means that face values from neighboring cells are not gatheredand face buﬀers for exterior values are left unchanged. In this way, we are able to demonstratethe eﬀects of increased working sets (of both face buﬀers) and of the increased number of sweeps.Additional ≥ d sweeps have to be performed for interpolating values from the cell quadraturepoints to the quadrature points of the 2 d faces (as well as for interpolating the values of theneighboring cells onto the quadrature points and for the ﬂux computation).Figure 11 shows the data traﬃc between the memory hierarchy levels (data volume per DoFand bandwidth), the ﬂoating-point operations per DoF, and the throughput for k = 3 and k = 5for 2 ≤ d ≤

6. In comparison to the results of the experiments that only consider the cell integralsin Figure 10, the following observations can be made: As expected, the data volume transferredbetween the cache levels (Figure 10a and 11a) and the number of ﬂoating-point operations approx-imately double (Figure 10c and 11c). However, the conﬁgurations at which the traﬃc to the nextcache level increases have not changed, indicating that the increase in working set is not limiting19 = 2 k = 3 k = 4 k = 5 L1 ↔ L2 L2 ↔ L3 Caches ↔ MEM L L L L DoFs per cell t h r o u g hpu t [ G D o F / s ] (a) Throughput. L L L L DoFs per cell p e r f o r m a n ce [ G F l o p / s ] (b) Performance. ideal - 3 L L L L DoFs per cell d a t a v o l u m e [ d o ub l e / D o F ] (c) Data volume per degree of freedom. STREAM - 205 GB/s L L L L DoFs per cell m e a s u r e db a nd w i d t h [ G B / s ] (d) Accumulated bandwidth. Figure 12: Node-level analysis of the application of the full advection operator.the performance here.The doubling of the data volume to be transferred for k = 5 and high dimensions naturallyleads to half the throughput (see Figure 11d). In the case of k = 3, we can also observe a drop ofperformance in high dimensions. This has another cause: The memory transfer between L1 and L2caches is up to 2,400GB/s and between L2 and L3 caches about 400GB/s. The latter value is thehalf of that observed for the cell-integral-only run, indicating that the data for the computationis mainly delivered form the L2 cache. This fact leads to an under-utilization of the L3 cache andof the main-memory bandwidth, resulting in the drop of the overall performance by 50% for highdimensions. This subsection considers the application of the full advection operator as shown in Algorithm 2,including the access to neighboring cells during the computation of the numerical ﬂux. Figure 12presents the results of parameter studies of the dimension 2 ≤ d ≤ ≤ k ≤ k + 1) d , is utilized as x -axis. Thisis done because the working set of this size is a suitable indicator of the overall performance ofthe operator evaluation (see Subsection 5.2). Also results taken from parameter studies of thepolynomial degree k are comparable to results obtained from parameter studies of the dimension d . The following observations can be made in Figure 12: For working sets that ﬁt into the L1cache, a higher polynomial degree leads to a higher throughput. For working sets exceeding thesize of the L1 cache and only ﬁtting into the L2 cache, the throughput drops. In this region, thecurves overlap and we conclude that the throughput is indeed a function of the working-set size ≈ ( k + 1) d and independent of the polynomial degree k and the dimension d individually.20able 6: Relative performance due to ﬂux com-putation (ratio Fig 12-10). k/d 2 3 4 5 63 58% 58% 51% 45% 41%4 66% 60% 53% 41% 54%5 68% 66% 54% 54% 52% Table 7: Relative performance due to access tothe values of neighboring cells (ratio Fig 12-11). k/d 2 3 4 5 63 73% 77% 76% 77% 79%4 79% 75% 77% 67% 80%5 82% 83% 78% 71% 91%

14 12 P u r e l o a d m e m o r y b w G B / s G B / s ( L ↔ L ) Peak DP

FLOP/byte ratio G F L O P / s k = 3 k = 5 L1 ↔ L2L2 ↔ L3 L3 ↔ MEM

Figure 13: Rooﬂine model for k = 3 / k = 3 and k = 5. In this model, the measuredperformance is plotted over the measured arithmetic intensity(measured arithmetic intensity) i = measured performance(measured bandwidth) i , (25)with i ∈ { L1 ↔ L2 , L2 ↔ L3 , Caches ↔ MEM } . We can compute the arithmetic intensity of eachlevel of memory hierarchy as we measure the necessary bandwidth with LIKWID . The diagram con-ﬁrms the observation made above: A high arithmetic intensity and consequently high performancecan only be reached if the caches (L1 and L2) are utilized well. As the working sets get too large,L1 and L2 caches are under-utilized, the arithmetic intensity on the other levels drops and newhard (bandwidth) ceilings limit the maximal possible performance.

In the following subsection, we compare the performance of the default setup of the library hyper.deal (tensor product of mappings, ECL, Gauss quadrature, vectorization over elementswith highest ISA extension for vectorization–see also Subsection 5.1) with the performance ofalternative algorithms and/or conﬁgurations.The library hyper.deal has been developed to be able to compute eﬃciently on complexgeometries both in geometric and velocity space. This is achieved by combining the Jacobians ofthe mapping of the individual spaces on the ﬂy. The upper limit of the performance of this approachis given by the consideration of the tensor product of two Cartesian grids, which leads to the sameconstant and diagonal Jacobian matrix at all quadrature points. As a lower limit, one can considerthe case that each quadrature point has a unique Jacobian of size d × d . Figure 14a shows that thebehavior of the default tensor-product setup is similar to that of a pure Cartesian grid simulationwith only a small averaged performance penalty of approx. 4%. This observation matches ourexpectations expressed in Subsection 3.3 and means that in high dimensions the evaluation ofcurved meshes in the tensor-product factors is essentially for free, compared to storing the fullJacobian matrices.In the default conﬁguration of hyper.deal , we have been using element-centric loops (ECL).They show signiﬁcantly better performance than face-centric loops (FCL) as shown in Figure 14b.We have not implemented any advanced blocking schemes—as it is available in deal.II —forneither ECL nor FCL. The fact that ECL nevertheless shows a better performance demonstratesthe natural cache-friendly property of ECL. Furthermore, ECL is well-suited for shared-memory21 = 3 k = 4 k = 5 L L L L DoFs per cell G D o F / s Cartesianfulltensor (a) Mapping strategies. L L L L DoFs per cell G D o F / s FCLECL (b) Face-centric loop. L L L L DoFs per cell G D o F / s collocationGauss quadrature (c) Collocation rule. L L L L v len × DoFs per cell G D o F / s auto SSE2AVX2 AVX-512 (d) (SIMD) ISA extension. Figure 14: a-c)Comparison of the performance of diﬀerent algorithms and the default conﬁgurationfor k = 3 and k = 5. d) Comparison of the performance of diﬀerent (SIMD) ISA extensions andthe auto-vectorization for k = 3 and k = 4. 22able 8: Partitioning the (cid:126)x - and (cid:126)v -triangulations on up to 3,072 nodes with 48 cores nodes 1 2 4 8 16 32 64 128 256 512 1,024 2,048 3,072 p (cid:126)x p (cid:126)v Table 9: Strong and weak scaling conﬁgurations: left) d = 4 / k = 5; right) d = 6 / k = 3 DoFs conﬁguration5.4GDoFs 384 · · DoFs conﬁguration2.1GDoFs 32 · · · computation and a single communication step is suﬃcient. The beneﬁt of ECL decreases for highdimensions due to the increased number of sweeps, related to the repeated evaluation of the ﬂuxterms. Nevertheless, we propose to use ECL for high-dimension problems because of its suitabilityfor shared-memory computations that reduce the allocated memory.We favor the Gauss–Legendre quadrature method over the collocation methods due to its highernumerical accuracy. This beneﬁt comes at the price of a basis change from the Gauss–Lobattopoints to the Gauss quadrature points and vice versa. Figure 14c shows a performance drop of onaverage 15% due to these basis changes as long as the data that should be interpolated remains incache.In the library hyper.deal , we currently only support “vectorization over elements”. As adefault, the highest instruction-set extension is selected, i.e., the maximum number of cells is pro-cessed at once by a core. Since the number of lanes to be used is templated, the user can reducethe number of elements that are processed at once, as it is demonstrated in Figure 14d. It can beobserved that, in general, the usage of higher instruction-set extensions leads to a better through-put. However, once the working set of a cell batch exceeds the size of the cache, a performancedrop can be observed. The performance drop leads to the fact that in 6D with cubic elementsthe throughputs of AVX-512 and of AVX2 are comparable and in 6D with quartic elements SSE2shows the best performance.In this subsection, we have demonstrated that the chosen default conﬁguration of the library hyper.deal has a competitive throughput compared to less memory-expensive and computation-ally demanding algorithms, which are numerically inferior. In this subsection, we examine the parallel eﬃciency of the library hyper.deal . For this study,we consider the advection operator embedded into a low-storage Runge–Kutta scheme of order 4with 5 stages, which uses two auxiliary vectors besides the solution vector [21]. From these threevectors, only one (auxiliary) vector is ghosted.Figure 15 shows strong and weak scaling results of runs on SuperMUC-NG with up to 3072nodes with a total of 147,456 cores. We consider two conﬁgurations: “ d = 4 / k = 5”, an easyconﬁguration, and “ d = 6 / k = 3”, a demanding conﬁguration. As examples, we present for eachconﬁguration two strong and two weak scaling curves (see Table 8). Table 9 shows the consideredprocess decomposition p = p (cid:126)x · p (cid:126)v , which has not been optimized for the given mesh conﬁgurations.For the “ d = 4 / k = 5” conﬁguration, we observe excellent weak-scaling behavior with paralleleﬃciencies of 92% and 89% for up to 2,048 nodes on the large and small setup, respectively. Weget more than 70/80% eﬃciency for strong scaling up to the increase in the number of nodes bya factor of 128/256. For the “ d = 6 / k = 3” conﬁguration, we see parallel eﬃciencies of 38/56%for weak scaling. These values are lower than the ones in the 4D case, however, they are still verygood in the light of the immense communication amount in the 6D case: As shown in Fig 16, theghost data amount to 40% of the solution vector in 6D and only 5% in 4D.The largest simulation with curved mesh, which contains 128 · = 2 . · degrees offreedom, reaches a total throughput of 1.2PDoFs/s(=1 . · DoFs/s) on 2048 compute nodesand 1.7PDoFs/s on 3072 compute nodes. This means that each Runge-Kutta stage is processed23 − − − . G D o F s . G D o F s Nodes ( ×

48 CPUs) T i m e p e r R un g e – K u tt a s t ag e [ s ] (a) Conﬁguration “ d = 4 / k = 5”. − − − . G D o F s . G D o F s Nodes ( ×

48 CPUs)alternative partitioning (see Fig. 17) (b) Conﬁguration “ d = 6 / k = 3”. Figure 15: Strong and weak scaling of one Runge–Kutta step with the advection operator asright-hand side with up to 147,456 processes on 3,072 compute nodes. (a) d = 4 / k = 5 with 1 . · DoFs. (b) d = 6 / k = 3 with 1 . · DoFs.

Figure 16: Memory consumption for 48 × d = 6 / k = 3, 1 . · DoFs). A total of 34.4PB main memory from available 98PBis used. The largest amount of memory is occupied by the three solution vectors (each. 23.2%).The two buﬀers for MPI communication occupy each 9.4%. One of the buﬀers is attributed tothe ghost-value section of the vector called tmp . The remaining data structures, which includeinter alia the mapping data, occupy only a small share (11.6%) of the main memory, illustratingthe beneﬁt of the tensor-product approach employed by the library hyper.deal in constructing amemory-eﬃcient algorithm for arbitrary complex geometries for high dimensions. As reference, thememory consumption for a 4D simulation is shown in Figure 16a. Obviously, the additional memoryconsumption due to the precomputed mapping slightly reduces the largest possible problem sizeﬁtting to a certain number of nodes. For example, the largest 4.4 trillion DoFs simulation ofFigure 15 only ﬁts into memory on 3072 compute nodes for a Cartesian mesh.Finally, we discuss the drop in parallel eﬃciency of the weak-scaling runs of the large-scalesimulations. For this, we have slightly modiﬁed the setup: We start from a conﬁguration of 8 cellswith k = 3 on one node (32 degrees of freedom in each direction). When doubling the number ofprocesses, we double the number of cells in one direction, starting from direction 1 to direction 6(and starting over at direction 1). Each time, we double the number of cells along a (cid:126)x -direction,we also double the number of processes p (cid:126)x , keeping p (cid:126)v constant and vice versa. In this way, thenumber of DoFs per process along (cid:126)x and (cid:126)v as well as the number of ghost degrees of freedomper process remain constant once all cells at the boundary have periodic neighbors residing onother nodes (number of nodes ≥ d (cid:126)x + d (cid:126)v ). As a consequence, the computational work load andthe communication amount of each node are constant. The total time per Runge–Kutta stage isshown in Figure 15 with dashed lines.The total communication amount of the considered setup increases linearly with the number ofprocesses as shown in Figure 17a. Measurements in Figure 17b show that the network can handlethis increase: The data can be sent with a constant network bandwidth of 10GB/s per node, which24 l i n e a r Nodes ( ×

48 CPUs) D a t a v o l u m e [ G B ] (a) Total communication data volume. G B / s / n o d e G B / s / n o d e Nodes ( ×

48 CPUs) B a nd w i d t h [ G B / s ] (b) Accumulated network bandwidth. ghost update (irecv+pack+isend) ghost update (waitall) element-centric loop Runge-Kutta update total . . ×

48 CPUs) S t a c k e d t i m e p e r s t ag e [ s ] (c) Averaged runtime distribution. . . . .

81 Nodes ( ×

48 CPUs) M i n / m a x t i m e p e r s t ag e [ s ] (d) Averaged minimum and maxi-mum runtime. . . . . − − − − − − Time per stage [s] N o r m a li ze d c o un t [ - ] (e) Histograms of step runtime (64nodes). Figure 17: Weak scaling of a single Runge–Kutta stage of the solution of the advection equation.25s close to the theoretical 100Gbit/s, as long as the job stays on an island due to the fat-treenetwork topology and with a smaller bandwidth once the job stretches over multiple islands dueto the pruned tree network architecture. In the latter case, we observed a bandwidth of 6GB/s,which is related to the fact that only a small ratio of the messages crosses island boundaries.Figure 17c and 17d show the time spent in diﬀerent sections (in the following referred to assteps) of the advection operator. We consider the following four steps:(1) Start a ghost-value update by calling

MPI Irecv as well as pack and send (vie

MPI Isend )messages to each neighboring process residing on remote compute nodes.(2) Finish the ghost-value update by waiting (with

MPI Waitall ) until all messages have beensent and received.(3) Evaluate the right-hand-side term of the advection equation by performing an element-centricloop on locally owned cells.(4) Perform the remaining Runge–Kutta update steps.Besides averaged times, the minimum and maximum times encountered on any process are shownfor each step. The times have been averaged over all Runge–Kutta stages.Not surprisingly, the imbalance in the element-centric loop and in the Runge–Kutta update issmall, which can be explained with the access to the shared recourses (mainly RAM). Furthermore,the ratio of communication in the overall run time increases with increasing number of nodes—inparticular, the time increases for low numbers of nodes, while new periodic neighbors are stilladded and for high node numbers when communication to other islands is required.What is surprising is that the minimum and the maximum time spent in updating the ghostvalues diﬀer signiﬁcantly: The diﬀerence increases even with increasing number of nodes. Thedetailed analysis of the histograms of the times on diﬀerent processors presented in Figure 17eshows that communications are processed according to a Gaussian distribution, resulting in a fewprocesses ﬁnishing much earlier and in a few ones much later. Overall, however, this imbalancecaused by the MPI communication is not performance-hindering as was demonstrated by thefact that a signiﬁcant portion of the network bandwidth is used. However, one should keep thisimbalance in mind and not attribute it accidentally to other sections of the code.

We now study the Vlasov–Poisson equations as an example of an application scenario for ourlibrary. We consider the Vlasov equation for electrons in a neutralizing background in the absenceof magnetic ﬁelds, ∂ f∂ t + (cid:18) (cid:126)v − (cid:126)E ( t, (cid:126)x ) (cid:19) · ∇ f = 0 , (26a)where the electric ﬁeld can be obtained from the Poisson problem ρ ( t, (cid:126)x ) = 1 − (cid:90) f ( t, (cid:126)x, (cid:126)v ) d v, −∇ (cid:126)x φ ( t, x ) = ρ ( t, (cid:126)x ) , (cid:126)E ( t, (cid:126)x ) = −∇ (cid:126)x φ ( t, (cid:126)x ) . (26b)For the solution of this problem, we use the low-storage Runge–Kutta method of order 4 with 5stages. Each stage contains the following ﬁve steps for evaluating the right-hand side:(1) Compute the degrees of freedom of the charge density via integration over the velocity space.(2) Compute the right-hand side for the Poisson equation.(3) Solve the Poisson equation for φ .(4) Compute (cid:126)E from φ .(5) Apply the advection operator.Step ((5)) is a d (cid:126)x + d (cid:126)v -dimensional problem, and Steps (2)–(4) are d (cid:126)x -dimensional problems.Step (1) reduces information from the phase space to the conﬁguration space.26 tep 1 Step 2 Step 3 Step 4 Step 5 − − total Nodes ( ×

48 CPUs) A v e / m i n / m a x t i m e p e r s t ag e [ s ] Figure 18: Weak scaling of a single Runge–Kutta stage of the solution of the Vlasov–Poissonequations. The average (dashed), averaged minimum and averaged maximum runtime of each stepare indicated.

The advection step (Step ((5))) relies on the operator analyzed in Section 5. The constant velocityﬁeld function (cid:126)a is replaced by the function (cid:126)a ( t, (cid:126)x, (cid:126)v ) (cid:62) = (cid:16) (cid:126)v (cid:62) , − (cid:126)E ( t, (cid:126)x ) (cid:62) (cid:17) . The following two pointsshould be noted: Firstly, (cid:126)v evaluated at a quadrature point corresponds exactly to a quadraturepoint in the (cid:126)v - domain and can hence be queried from a low-dimensional FEM library. Similarly, (cid:126)E is independent of the velocity (cid:126)v and can be precomputed once at each Runge–Kutta stage forall quadrature points in the (cid:126)x -space. Exploiting these relations, we never compute the d (cid:126)x + d (cid:126)v -dimensional velocity ﬁeld, but compose the d (cid:126)x - and d (cid:126)v -information on the ﬂy, just as we did inthe case of the mapping. Since the data to be loaded per quadrature point is negligible (see alsothe reasoning regarding the Jacobian matrices in Subsection 3.3), the throughput of the advectionoperator is weakly eﬀected by the need to evaluate the velocity ﬁeld at every quadrature point.For the solution of the Poisson problem( ∇ (cid:126)x ψ, ∇ (cid:126)x φ ) (cid:126)x = ( ψ, ρ ) (cid:126)x , (27)we utilize a matrix-free geometric multigrid solver [14; 28] from deal.II , which uses a Chebyshevsmoother [1]. The Poisson problem is solved by each process group with a constant velocity grid(see row comm in Subsection 3.2). The result of this is that the solution φ is available on eachprocess without the need of an additional broadcast step.The integration of f over the velocity space is implemented via an MPI Allreduce operationover all processes with the same (cid:126)x grid partition (i.e., colomn comm ) so that the resulting ρ isavailable on all processes.We have veriﬁed our implementation with a simulation of the Landau damping problem asin [22]. We perform a weak-scaling experiment for the 6D Vlasov equation, starting from a conﬁgurationof 8 cells with k = 3 on one node. When doubling the number of processes, we double the numberof cells in one direction, starting from direction 1 to direction 6 (and starting over at direction1). Each time, we double the number of cells along a (cid:126)x -direction, we also double the number ofprocesses p (cid:126)x keeping p (cid:126)v constant and vice versa. In this way, the number of DoFs per processalong (cid:126)x and (cid:126)v remains constant.Figure 18 shows the scaling of Steps 1-5 of one Runge–Kutta stage. We can see that the totalcomputing time is dominated by the 6D-advection step, which we have analyzed earlier.Step 1, which reduces f to ρ , becomes increasingly important as the problem size and parallelismincrease. In this step, an all-reduction is performed over process groups with constant p (cid:126)x coordinate27n the process grid (called comm column in Subsection 3.2). The amount of data sent by each processcorresponds to the number of DoFs in (cid:126)x -direction of this process and is thus the same in everyexperiment. The total amount of data sent/received is thus proportional to the total number ofprocesses, while the size of the subcommunicators—and hence the number of reduction steps—onlydepends on p (cid:126)v (log( p (cid:126)v )). The scaling experiment shows that the time needed by step 1 generallyincreases with the number of nodes and that this step has the worst scaling behavior. This can beexplained by the fact that this step contains a global communication within the subcommunicators,while the other steps only contain nearest-neighbor communication. We also note that the processgrid is designed to optimize the communication scheme of the advection step. In step 1, this resultsin the fact that each subcommunicator might involve all the nodes.The steps 2 to 4 are three-dimensional problems that are mostly negligible, only the Poissonsolver (step 3) has some impact on the total computing time. Let us note that the 3D parts aresolved p (cid:126)v times on all subcommunicators (called comm row in Subsection 3.2) with constant p (cid:126)v coordinate. Between the ﬁrst and the forth data point as well as between the seventh and thetenth data point, the size of the Poisson problem increases, which results in a slight increase of thecomputing time. Between data point 4 and 7 as well as 10 and 11, the problem size stays constantand only the number of process groups performing these operations increases, which is why theCPU time stays approximately constant between these steps. We have presented the ﬁnite-element library hyper.deal . It eﬃciently solves high-dimensional par-tial diﬀerential equations on complex geometries with high-order discontinuous Galerkin methods.It constructs a high-dimensional triangulation via the tensor product of distributed triangulationswith dimensions up to three from the low-dimensional FEM library deal.II and solves the givenpartial diﬀerential equation with sum-factorization-based matrix-free operator evaluation. To re-duce the memory consumption and the communication overhead, we use a new vector type, whichis built around the shared-memory features from MPI-3.0.We have compared the node-level performance with alternative algorithms, which are special-ized for Cartesian and aﬃne meshes or use collocation integration schemes. These studies revealedthat loading less mapping data and reducing the number of sweeps is beneﬁcial to improve theperformance. We plan to look into the beneﬁts of computing the low-dimensional mapping infor-mation on the ﬂy and the beneﬁts of increasing the cache locality during sum-factorization sweepsby a suitable hierarchical cache-oblivious blocking strategy.Furthermore, we have studied the reduction of the working set of “vectorization over elements”by processing fewer cells in parallel as SIMD lanes would allow. Since we observed the beneﬁt of thisapproach for 6D and polynomial orders higher than three, we intend to investigate “vectorizationwithin an element” as an alternative vectorization approach in the future.In this work, we have addressed neither overlapping communication and computation nor graph-based partitioning of the mesh. Both points will supposedly mitigate the scaling limits due to thehuge amount of communication that we have encountered especially in 6D.28

APPENDIX

The following code snippets give implementation details on the new vector class, which is builtaround MPI-3.0 features (see Section 4). A new MPI communicator comm sm , which consists ofprocesses from the communicator comm that have access to the same shared memory, can be createdvia:

MPI_Comm_split_type ( comm , MPI_COMM_TYPE_SHARED , rank , MPI_INFO_NULL , & comm_sm ) ;

We recommend to create this communicator only once globally during setup and pass it to thevector. The following code snippet shows a simpliﬁed vector class. It is a container for the (localand ghosted) data and for the partitioner class as well as it provides linear algebra operations (notshown here). c l a s s

Vector { Vector ( std : : shared_ptr < Partitioner > partitioner ): data_this ( ( T ∗ ) malloc ( 0 ) ) { t h i s − > partitioner = partitioner ;t h i s − > partitioner − > initialize_dof_vector ( data_this , data_others , win ) ; } void clear ( ) { MPI_Win_free (& win ) ; / ∗ f r e e window ∗ / } void update_ghost_values_start ( ) c o n s t { partitioner − > update_ghost_values_start ( t h i s − > data_this , t h i s − > data_others ) ; } void update_ghost_values_finish ( ) c o n s t { partitioner − > update_ghost_values_finish ( t h i s − > data_this , t h i s − > data_others ) ; } std : : shared_ptr < Partitioner > partitioner ; MPI_Win win ;mutable T ∗ data_this ;mutable std : : vector < T ∗ > data_others ; } ; The partitioner class stores the information on which cells the current process owns and whichghost faces it needs. Based on this information, it is able to determine which cells/faces are sharedand to precompute communication patterns with other compute nodes. As a consequence, thepartitioner class is able to allocate the memory of the vector class. c l a s s

Partitioner { void reinit ( c o n s t vector < unsigned int > & local_cells ,c o n s t vector < pair < unsigned int , unsigned int >> & ghost_faces ) { // based on the IDs o f l o c a l c e l l s and ghost f a c e s s e t up the a c c e s s to the// shared memory and the communication p a t t e r n s to o t h e r compute nodes// as w e l l as a l l o c a t e b u f f e r } void initialize_dof_vector ( T ∗ & data , vector < T ∗ > & data_others , MPI_Win & win ) { // c o n f i g u r e shared memory MPI_Info info ; MPI_Info_create (& info ) ; MPI_Info_set ( info , ” a l l o c s h a r e d n o n c o n t i g ” , ” t r u e ” ) ;// a l l o c a t e shared memory MPI_Win_allocate_shared ( ( _local_size + _ghost_size ) ∗ s i z e o f ( T ) , s i z e o f ( T ) , info , comm_sm , data , & win ) ; / get p o i n t e r s to the data_others . resize ( size_sm ) ;f o r ( i n t i = 0 ; i < size_sm ; i ++) { i n t disp_unit ; MPI_Aint ssize ; MPI_Win_shared_query ( win , i , & ssize , & disp_unit , & data_others [ i ] ) ; } data_this = data_others [ rank_sm ] ; } void update_ghost_values_start ( T ∗ & data_this , vector < T ∗ > & data_others ) c o n s t { MPI_Barrier ( comm_sm ) ; // sync p r o c e s s e s i n same shared memory communicator// s t a r t communication with o t h e r nodes and f i l l b u f f e r ( not shown ) } void update_ghost_values_finish ( T ∗ & data_this , vector < T ∗ > & data_others ) c o n s t { // f i n i s h communication with o t h e r nodes ( not shown ) MPI_Barrier ( comm_sm ) ; // sync p r o c e s s e s i n same shared memory communicator } c o n s t MPI_Comm & comm ; // g l o b a l communicatorc o n s t MPI_Comm & comm_sm ; // shared memory communicatori n t rank_sm , size_sm ; } ; The partitioner class only has to be created once and can be shared by multiple vectors. In oursimulations, it was shared by three vectors, with only one vector being ghosted.

References [1] Mark Adams, Marian Brezina, Jonathan Hu, and Ray Tuminaro. 2003. Parallel multigridsmoothing: polynomial versus Gauss–Seidel.

Journal of Computational Physics

188 (2003),593–610.

DOI:http://dx.doi.org/10.1016/S0021-9991(03)00194-3 [2] Martin S. Alnæs, Jan Blechta, Johan Hake, August Johansson, Benjamin Kehlet, AndersLogg, Chris Richardson, Johannes Ring, Marie E. Rognes, and Garth N. Wells. 2015. TheFEniCS Project Version 1.5.

Archive of Numerical Software

3, 100 (2015), 9–23.

DOI:http://dx.doi.org/10.11588/ans.2015.100.20553 [3] Giovanni Alzetta, Daniel Arndt, Wolfgang Bangerth, Vishal Boddu, Benjamin Brands, DenisDavydov, Rene Gassmoeller, Timo Heister, Luca Heltai, Katharina Kormann, Martin Kro-nbichler, Matthias Maier, Jean-Paul Pelteret, Bruno Turcksin, and David Wells. 2018. Thedeal.II Library, version 9.0.

Journal of Numerical Mathematics

26, 4 (2018), 173–183.

DOI:http://dx.doi.org/10.1515/jnma-2018-0054 [4] Robert Anderson, Julian Andrej, Andrew Barker, Jamie Bramwell, Jean-Sylvain Camier,Jakub Cerverny, Johann Dahm, Veselin Dobrev, Yohann Dudouit, Aaron Fisher, TzanioKolev, Will Pazner, Mark Stowell, Vladimir Tomov, Johann Dahm, David Medina, and Ste-fano Zampini. 2020. MFEM: A Modular Finite Element Library. arXiv

Computers and Mathematics withApplications submitted (2020).[6] Markus Bachmayr, Reinhold Schneider, and Andr´e Uschmajew. 2016. Tensor Networksand Hierarchical Tensors for the Solution of High-Dimensional Partial Diﬀerential Equa-tions.

Foundations of Computational Mathematics

16, 6 (01 Dec 2016), 1423–1472.

DOI:http://dx.doi.org/10.1007/s10208-016-9317-9

ACMTrans. Math. Software

38, 2 (dec 2011), 1–28.

DOI:http://dx.doi.org/10.1145/2049673.2049678 [8] Peter Bastian, Christian Engwer, Jorrit Fahlke, Markus Geveler, Dominik G¨oddeke, Oleg Iliev,Olaf Ippisch, Ren´e Milk, Jan Mohring, Steﬀen M¨uthing, Mario Ohlberger, Dirk Ribbrock,and Stefan Turek. 2016. Hardware-Based Eﬃciency Advances in the EXA-DUNE Project.In

Software for Exascale Computing – SPPEXA 2013–2015 , Hans-Joachim Bungartz, PeterNeumann, and Wolfgang E. Nagel (Eds.). Springer International Publishing, Cham, 3–23.[9] Gheorghe-Teodor Bercea, Andrew T.T. McRae, David A. Ham, Lawrence Mitchell, FlorianRathgeber, Luigi Nardi, Fabio Luporini, and Paul H.J. Kelly. 2016. A structure-exploitingnumbering algorithm for ﬁnite elements on extruded meshes, and its performance evaluationin Firedrake. arXiv

SIAM Journal on ScientiﬁcComputing

33, 3 (jan 2011), 1103–1133.

DOI:http://dx.doi.org/10.1137/100791634 [11] Denis Davydov, Jean-Paul Pelteret, Daniel Arndt, Martin Kronbichler, and Paul Steinmann.2020. A matrix-free approach for ﬁnite-strain hyperelastic problems using geometric multigrid.

Internat. J. Numer. Methods Engrg. in press (2020).

DOI:http://dx.doi.org/10.1002/nme.6336 [12] Andreas Dedner, Robert Kl¨ofkorn, Martin Nolte, and Mario Ohlberger. 2010. A genericinterface for parallel and adaptive scientiﬁc computing: Abstraction principles and theDUNE-FEM module.

Computing

90 (11 2010), 165–196.

DOI:http://dx.doi.org/10.1007/s00607-010-0110-3 [13] Michel O. Deville, Paul F. Fischer, and Ernest H. Mund. 2002.

High-Order Methods forIncompressible Fluid Flow . Vol. 9. Cambridge University Press, Cambridge.[14] Niklas Fehn, Peter Munch, Wolfgang A. Wall, and Martin Kronbichler. 2019. Hybrid multigridmethods for high-order discontinuous Galerkin discretizations.

ArXiv abs/1910.01900 (2019).[15] Wei Guo and Yingda Cheng. 2016. A Sparse Grid Discontinuous Galerkin Method for High-Dimensional Transport Equations and Its Application to Kinetic Simulations.

SIAM Journalon Scientiﬁc Computing

38, 6 (2016), A3381–A3409.

DOI:http://dx.doi.org/10.1137/16M1060017 [16] Ammar Hakim, Greg Hammett, Eric L. Shi, and Noah Mandell. 2019. Discontinuous Galerkinschemes for a class of Hamiltonian evolution equations with applications to plasma ﬂuid andkinetic problems. arXiv

ACM Trans. Math.Software

31, 3 (sep 2005), 397–423.

DOI:http://dx.doi.org/10.1145/1089014.1089021 [18] James Juno, Ammar Hakim, Jason TenBarge, Eric L. Shi, and William Dorland. 2018. Discon-tinuous Galerkin algorithms for fully kinetic plasmas.

J. Comput. Phys.

353 (2018), 110–147.

DOI:http://dx.doi.org/10.1016/j.jcp.2017.10.009 [19] George Karypis and Vipin Kumar. 1998.

METIS: A Software Package for Partitioning Un-structured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of SparseMatrices .[20] Dominic Kempf, Ren´e Hess, Steﬀen M¨uthing, and Peter Bastian. 2018. Automatic CodeGeneration for High-Performance Discontinuous Galerkin Methods on Modern ArchitecturesarXiv : 1812 . 08075v1 [ math . NA ] 19 Dec 2018. (2018).3121] Christopher A. Kennedy, Mark H. Carpenter, and R. Michael Lewis. 2000. Low-storage,explicit Runge–Kutta schemes for the compressible Navier–Stokes equations.

Applied Numer-ical Mathematics

35, 3 (2000), 177–219.

DOI:http://dx.doi.org/10.1016/s0168-9274(99)00141-5 [22] Katharina Kormann, Klaus Reuter, and Markus Rampp. 2019. A massively parallel semi-Lagrangian solver for the six-dimensional Vlasov–Poisson equation.

The International Journalof High Performance Computing Applications

33, 5 (2019), 924–947.

DOI:http://dx.doi.org/10.1177/1094342019834644 [23] Benjamin Krank, Niklas Fehn, Wolfgang A. Wall, and Martin Kronbichler. 2017. A high-order semi-explicit discontinuous Galerkin solver for 3D incompressible ﬂow with applicationto DNS and LES of turbulent channel ﬂow.

J. Comput. Phys.

DOI:http://dx.doi.org/10.1016/j.jcp.2017.07.039 [24] Martin Kronbichler and Katharina Kormann. 2012. A generic interface for parallel cell-basedﬁnite element operator application.

Computers and Fluids

63 (2012), 135–147.

DOI:http://dx.doi.org/10.1016/j.compfluid.2012.04.012 [25] Martin Kronbichler and Katharina Kormann. 2019. Fast Matrix-Free Evaluation of Discon-tinuous Galerkin Finite Element Operators.

ACM Trans. Math. Softw.

45, 3, Article Article29 (Aug. 2019), 40 pages.

DOI:http://dx.doi.org/10.1145/3325864 [26] Martin Kronbichler, Katharina Kormann, Niklas Fehn, Peter Munch, and Julius Witte. 2019.A Hermite-like basis for faster matrix-free evaluation of interior penalty discontinuous Galerkinoperators. arXiv preprint arXiv:1907.08492 (2019).[27] Martin Kronbichler, Svenja Schoeder, Christopher M¨uller, and Wolfgang A. Wall. 2016.Comparison of implicit and explicit hybridizable discontinuous Galerkin methods for theacoustic wave equation.

Internat. J. Numer. Methods Engrg.

106 (2016), 712–739.

DOI:http://dx.doi.org/10.1002/nme.5137 [28] Martin Kronbichler and Wolfgang A Wall. 2018. A Performance Comparison of Continuousand Discontinuous Galerkin Methods with Fast Multigrid Solvers.

SIAM Journal on ScientiﬁcComputing

40, 5 (2018), A3423–A3448.

DOI:http://dx.doi.org/10.1137/16M110455X [29] J. Markus Melenk, Klaus Gerdes, and Christoph Schwab. 2001. Fully discrete hp-ﬁnite ele-ments: fast quadrature.

Computer Methods in Applied Mechanics and Engineering

DOI:http://dx.doi.org/10.1016/S0045-7825(00)00322-4 [30] Steven A. Orszag. 1980. Spectral methods for problems in complex geometries.

J. Comput.Phys.

37, 1 (aug 1980), 70–92.

DOI:http://dx.doi.org/10.1016/0021-9991(80)90005-4 [31] Thomas Roehl, Jan Treibig, Georg Hager, and Gerhard Wellein. 2014. Overhead Analysisof Performance Counter Measurements. In , Vol. 2015-May. IEEE, Minneapolis, Minnesota, USA, 176–185.

DOI:http://dx.doi.org/10.1109/ICPPW.2014.34 [32] Svenja Schoeder, Katharina Kormann, Wolfgang A. Wall, and Martin Kronbichler. 2018.Eﬃcient explicit time stepping of high order discontinuous Galerkin schemes for waves.

SIAMJournal on Scientiﬁc Computing

40, 6 (2018), C803–C826.

DOI:http://dx.doi.org/10.1137/18M1185399 [33] Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In , Wang-Chien Lee (Ed.). IEEE, Piscataway, NJ, 207–216.

DOI:http://dx.doi.org/10.1109/ICPPW.2010.38 [34] Takayuki Umeda, Keiichiro Fukazawa, Yasuhiro Nariyuki, and Tatsuki Ogino. 2012. A Scal-able Full-Electromagnetic Vlasov Solver for Cross-Scale Coupling in Space Plasma.

IEEETransactions on Plasma Science

40, 5 (May 2012), 1421–1428.

DOI:http://dx.doi.org/10.1109/TPS.2012.2188141

ACM Transactions on Mathematical Software (TOMS)

45, 2 (2019), 1–41.

DOI:http://dx.doi.org/10.1145/3319797 [36] Samuel Williams, Andrew Waterman, and David Patterson. 2009. Rooﬂine : An InsightfulVisual Performance Model for Multicore Architectures.

Commun. ACM

52, 4 (apr 2009), 65.