[PDF] Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes

Abstract

We present a sequence of optimizations to the performance-critical compute kernels of the high-order discontinuous Galerkin solver of the hyperbolic PDE engine ExaHyPE -- successively tackling bottlenecks due to SIMD operations, cache hierarchies and restrictions in the software design. Starting from a generic scalar implementation of the numerical scheme, our first optimized variant applies state-of-the-art optimization techniques by vectorizing loops, improving the data layout and using Loop-over-GEMM to perform tensor contractions via highly optimized matrix multiplication functions provided by the LIBXSMM library. We show that memory stalls due to a memory footprint exceeding our L2 cache size hindered the vectorization gains. We therefore introduce a new kernel that applies a sum factorization approach to reduce the kernel's memory footprint and improve its cache locality. With the L2 cache bottleneck removed, we were able to exploit additional vectorization opportunities, by introducing a hybrid Array-of-Structure-of-Array data layout that solves the data layout conflict between matrix multiplications kernels and the point-wise functions to implement PDE-specific terms. With this last kernel, evaluated in a benchmark simulation at high polynomial order, only 2\% of the floating point operations are still performed using scalar instructions and 22.5\% of the available performance is achieved.

Full PDF

VVectorization and Minimization of MemoryFootprint for Linear High-Order DiscontinuousGalerkin Schemes

Jean-Matthieu Gallard, Leonhard Rannabauer, Anne Reinarz, Michael Bader

Department of Informatics, Technical University of MunichEmail: { gallard,leonhard.rannabauer,reinarz,bader } @in.tum.de Abstract —We present a sequence of optimizations to theperformance-critical compute kernels of the high-order discon-tinuous Galerkin solver of the hyperbolic PDE engine ExaHyPE– successively tackling bottlenecks due to SIMD operations, cachehierarchies and restrictions in the software design.Starting from a generic scalar implementation of the nu-merical scheme, our ﬁrst optimized variant applies state-of-the-art optimization techniques by vectorizing loops, improvingthe data layout and using Loop-over-GEMM to perform tensorcontractions via highly optimized matrix multiplication functionsprovided by the LIBXSMM library. We show that memorystalls due to a memory footprint exceeding our L2 cache sizehindered the vectorization gains. We therefore introduce a newkernel that applies a sum factorization approach to reducethe kernel’s memory footprint and improve its cache locality.With the L2 cache bottleneck removed, we were able to exploitadditional vectorization opportunities, by introducing a hybridArray-of-Structure-of-Array data layout that solves the datalayout conﬂict between matrix multiplications kernels and thepoint-wise functions to implement PDE-speciﬁc terms.With this last kernel, evaluated in a benchmark simulation athigh polynomial order, only 2% of the ﬂoating point operationsare still performed using scalar instructions and 22.5% of theavailable performance is achieved.

Index Terms —ExaHyPE, Code Generation, High-Order Dis-continuous Galerkin, ADER, Hyperbolic PDE Systems, Vector-ization, Array-of-Struct-of-Array

I. I

NTRODUCTION engine (as in “game engine”)and while focusing on a dedicated numerical scheme and ona ﬁxed mesh infrastructure, it provides ﬂexibility in the PDEsystem to be solved [1].ExaHyPE employs a high-order discontinuous Galerkin(DG) method combined with ADER (Arbitrary high-orderDERivative) time-stepping ﬁrst proposed in [2]. A-posteriori Finite-Volume-based limiting addresses the problem of insta-bilities that may occur for non-linear setups [3]. The ADERmethod is an explicit one-step predictor-corrector scheme. Thesolution is ﬁrst computed element-locally and then correctedusing contributions from neighboring elements. A cache-aware, communication-avoiding ADER-DG realization [4] al-lows high-order accuracy in time with just a single (amortized)mesh traversal, which leads to high arithmetic intensity. Thishigh arithmetic intensity leads to advantages in performanceand time-to-solution compared to Runge-Kutta-DG (RK-DG)approaches (cf. [5]). Previous performance studies [6] alsoshowed, however, that these advantages can be impeded bya large memory footprint for the element-local predictor step.ADER-DG hence faces slightly different optimization chal-lenges than the more widespread RK-DG methods.For ﬂexibility and modularity, ExaHyPE isolates theperformance-critical components of the ADER-DG schemeinto compute kernels. Multiple variants of each kernel exist,allowing the user to adapt the scheme to a given application’snumerical requirements – for example, choosing between ascheme for a linear or a non-linear PDE system. Furthermore,code generation utilities are used before compilation to pro-duce kernels that are tailored toward both the given applicationand the target architecture, as speciﬁed by the user [7].ExaHyPE’s performance is heavily dominated by the SpaceTime Predictor (STP), which computes an entirely element-local time extrapolation of the solution. For linear PDE sys-tems, such as in the context of seismic simulations [8], theSTP is computed via a Cauchy-Kowalewsky scheme, whichrequires tensor operations that imply calls to PDE-speciﬁc user functions . This generates conﬂicting requirements on theExaHyPE API: the data layout needed for optimization of thetensor operations and that needed for vectorization of the userfunctions differs (AoS vs. SoA).In this paper, we show the gradual optimization of thelinear STP kernel starting from the generic non-optimizedalgorithm. As each further optimization step also requires morework by the user, each new variant is added as an opt-infeature. Combined with the code generation approach used inExaHyPE this ensures continuing sustainability of the code.We start with a brief overview of the engine, the ADER-DG scheme, kernels and a focus on the linear STP kernel’sCauchy-Kowalewsky scheme. In Sec. III we justify the choice a r X i v : . [ c s . M S ] M a r f data layout and discuss the optimization of the STP kernelusing ExaHyPE’s code generation utilities and the LIBXSMMlibrary [9]. Next we discuss how memory stalls impedeefﬁciency of these optimizations and how to reformulate thekernel’s algorithm to reduce its memory footprint and thusmemory stalls. In Sec. V we describe how this removedbottleneck allows to exploit further vectorization opportunitiesby using a hybrid data layout to solve the data layout conﬂictcaused by the API requirements. Finally, we evaluate andcompare the performance of all STP kernel variants in variousbenchmarks, focusing on the Intel Skylake architecture.II. T HE E XA H Y PE E

NGINE

A. ADER-DG solver

The ExaHyPE engine is designed to solve a wide class ofsystems of linear and non-linear hyperbolic PDEs. However,in this paper we focus on efﬁciently solving linear equations.In matrix form the equations are as followsQ t = M ( ∇ · ( F ( Q )) + B · ∇ Q ) + δ x , (1)where Q represents the time and space dependent phys-ical quantities of the system. The temporal evolution ofthe quantities is deﬁned by the conservative ﬂux F ( Q ) =( F Q , F Q , F Q ) , as well as the non-conservative ﬂux termB · ∇ Q = ( B ∇ x Q , B ∇ y Q , B ∇ z Q ) . M models materialproperties. Point sources are modeled via δ x .In a DG scheme the numerical solution of (1) is representedby polynomials within each cell. We use a nodal basis givenby the Lagrange polynomials with either Gauss-Legendre orGauss-Lobatto interpolation points. The hexahedral mesh usedin ExaHyPE allows us to use a tensor product basis, i.e.each basis function is composed of one-dimensional basisfunctions Φ k ( x, y, z ) = φ k ( x ) φ k ( y ) φ k ( z ) . The multi-index k = ( k , k , k ) refers to a speciﬁc nodal coordinate.The unknown state vector can now be approximated in eachelement by Q ( x, y, z, t ) ≈ (cid:80) k φ k ( x ) φ k ( y ) φ k ( z ) q k ( t ) .For brevity we refer to [1], [3] for a more detailed intro-duction of the semi-discrete scheme, which relies on a strongformulation of (1) and can be summarized as follows. • Multiply (1) with a test function from the same space asthe ansatz function and integrate over each element. • Integration by parts (twice) of the ﬂux terms. After thisstep we are left with volume and face integrals. Tocompute the face integrals we introduce numerical ﬂuxes F ∗ . Here we assume that F ∗ is linear in Q and F, whichsimpliﬁes the algorithm signiﬁcantly. • Project the equation from the mesh elements onto thereference unit cube.To solve linear problems of the form (1) for suitableinitial and boundary conditions we implemented the Cauchy-Kowalewsky (CK) method in ExaHyPE.After the discretization in space the computation of therequired volume and face integrals reduces to the computationof cell-local tensor operations. To evaluate the necessary ∇ · F denotes divergence of F , ∇ Q the spatial gradient of Q. integrals we apply a quadrature rule based on the nodal pointschosen for the Lagrange polynomials. We require N nodalpoints per dimension to achieve N -th order convergence.In each cell we must compute the following matrices: • the discrete mass matrix M – this results in a diagonalmass matrix with the quadrature weights as entries, savingus the effort of inverting the mass matrix. • the discrete derivative operator D. • a projection operator P – this projection is needed tocompute the effect of the point source on each node. Itprojects the point source onto each nodal basis function. • F , F , F : the physical ﬂux in each direction • B , B , B : the non-conservative ﬂux in each directionIn the resulting scheme we compute two kernels. The ﬁrstincorporates all integrals in the reference element, which wecall the volume term. Each degree of freedom is given by amultiindex for the basis function and an index for the variable: V k r, l s q l s = M sr (cid:0) D l k F sr q k l l r + B sr D l k q k l l r + D l k F sr q l k l r + B sr D l k q l k l r + D l k F sr q l l k r + B sr D l k q l l k r (cid:1) + P l δ x The second kernel includes all integrals along the boundary ofthe reference element. We denote the neighborhood of a givencell with N , and then sum over the adjacent faces f : (cid:88) f ∈N F ∗ (cid:0) q f ( t ) k r , ˆ q f ( t ) k p , F f ( t ) k r , ˆ F f ( t ) k r (cid:1) ,where q f denotes the projected degrees of freedom of the faceand ˆ q f those of the neighboring element.In this notation the time-integrated semi-discrete schemecan be written as: q k r ( t n +1 ) = q k r ( t n ) + V k r, l s (cid:90) t n +1 t n q l s ( t ) dt − (cid:88) f ∈N (cid:90) t n +1 t n F ∗ ( q f ( t ) k r , ˆ q f ( t ) k r , F f ( t ) k r , ˆ F f ( t ) k r ) (2)To complete the discretization we need to discretize (2) intime. The underlying assumption in this step is that for a smallenough timestep ∆ t the volume operator is approximatelyequal to the temporal derivative: δq l s ( t n ) δt ≈ V k r l s q l s ( t n ) (3)By evolving q l s ( t ) in a Taylor series of order N and using (3)we can compute (2): (cid:90) t n +1 t n q k r ( t ) dt ≈ N (cid:88) o =0 (∆ t ) o +1 ( o + 1)! V o k r, l s q l s ( t n ) =: ¯ q l s ( t n ) , (4)The summands of (4) can be computed iteratively. p o k r := V o k r, l s q l s ( t n ) = V k r, l s p o − , l s . The vector p i k p is called Space Time Predictor (STP) and itscomputation is the main focus of this work. It operates only onell-local information and is considerably more expensive tocompute than the corrector step. Since F ∗ is a linear operation,the semi-discrete scheme can be transformed to q k r ( t n +1 ) = q k r ( t n ) + V k r, l s ¯ q l s ( t n )+ (cid:88) f ∈N F ∗ (cid:0) ¯ q f ( t n ) k r , ˆ¯ q f ( t n ) k r , ¯ F f ( t n ) k r , ˆ¯ F f ( t n ) k r (cid:1) (5)We call this the corrector step. B. Kernels and Linear STP

In this paper we discuss various implementations of theSTP kernel. For each implementation the output must bethe required input of the corrector step, i.e. ¯ q l s ( t n ) , ¯ q f k r ( t n ) and ¯ F f ( t n ) k r . The projection onto the face of an element isperformed by a single matrix-matrix multiplication, leaving noroom for optimization. We will thus only present implemen-tations to compute ¯ q l s ( t n ) and ¯ F l s ( t n ) .The generic implementation of the STP follows the mathe-matical formulation of (4). We iterate over the derivatives ofthe Taylor series of (4) and store the whole STP (in the array p ) and its ﬂuctuations (in the array dF ). Using these we cancompute the time averaged degrees of freedom (in the array qavg ) and ﬂuctuations (in the array favg ). See Fig. 1 forthe entire pseudocode. C. Engine structure

Fig. 2 shows the architecture of ExaHyPE, which follows astrict separation of concerns. It builds on the Peano framework[10] (dark-green box), which provides dynamical adaptivemesh reﬁnement on tree-structured Cartesian meshes togetherwith shared- and distributed-memory parallelization.Application-speciﬁc contributions of users (i.e., developersof applications using the engine) are illustrated as cyan boxes.To write an ExaHyPE application users provide a speciﬁ-cation ﬁle, which is passed to the ExaHyPE

Toolkit . TheToolkit creates glue code, empty application-speciﬁc classesand (most important for this paper) core kernels that aretailored towards application and architecture (light-green box)[7]. Users need to complete the application-speciﬁc classesby providing PDE- and scenario-speciﬁc implementations (ofﬂux functions, boundary conditions, etc.). Note that these userfunctions also depend on the requirements of the speciﬁcationﬁle and the kernel variant used. For example, ExaHyPE’s APIby default relies on point-wise user functions that operate ona single quadrature node. However some kernels, like the STPkernel variant described in Sec. V, work on user functions thatoperate on vectorizable chunks.In the context of this paper it is also important to notethe optional usage of LIBXSMM [9] (red box). This libraryprovides efﬁcient routines for small matrix multiplications,which are used by the optimized ADER-DG kernels.

D. Kernel Generator

To generate application and architecture tailored kernels,the Toolkit uses a Kernel Generator submodule. The KernelGenerator is a Python 3 module that can be invoked manually / ∗ Compute t h e Space Time P r e d i c t o r ∗ / p [ 0 , k ] = q [ k ] +p o i n t S o u r c e ( t =t n , c o o r d i n a t e = k ) f o r ( o = 0 ; o < N; o++ ) { f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { f o r ( d = 0 ; d < { f l u x [ o , d , k ] ← computeF ( p [ o , k ] , dim=d ) }} f o r ( d = 0 ; d < { f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { dF [ o , d , k ] ← d e r i v e ( f l u x [ o , d , ∗ ] , c o o r d i n a t e =k , dim=d ) }} f o r ( d = 0 ; d < { f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { gradQ [ o , d , k ] ← d e r i v e ( p [ o , d , ∗ ] , c o o r d i n a t e =k , dim=d ) }} f o r ( d = 0 ; d < { f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { dF [ o , d , k ] ← dF [ o , d , k ] +computeNcp ( gradQ [ o , k ] , dim=d ) }} f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { p [ o +1 , k ] ← p [ o , k ] +d e r i v e ( p o i n t S o u r c e ( t = ∗ , c o o r d i n a t e =k ) ,dim=time , o r d e r =o ) f o r ( d =0; d < { p [ o +1 , k ] ← p [ o +1 , k ] + dF [ o , d , k ] }}} / ∗ Compute t h e t i m e averaged v a l u e s ∗ / f o r ( o = 0 o < N; o++ ) { f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { qavg [ k ] ← qavg [ k ] + p [ o , k ] ∗ ( d t ˆ { o+1 } / ( o + 1 ) ! ) }} f o r ( d = 0 ; d < { f o r ( o = 0 ; o < N; o++ ) { f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { favg [ d , k ] ← favg [ d , k ] + dF [ o , d , k ] ∗ ( d t ˆ { o+1 } / ( o + 1 ) ! ) }}} Fig. 1. Pseudocode for the generic implementation of the STP. or automatically by the Toolkit. If invoked by the Toolkit,the optimized kernels are called by the generated glue code.The generic kernels used by default only provide minimalcustomizability toward the application and none toward thearchitecture. The Kernel Generator, like the Toolkit, followsthe Model-View-Controller (MVC) architectural pattern anduses the template engine Jinja2 to generate the C++ kernels.This facilitates the introduction of new variants of a givenkernel as a new template (View) in the module, and decouplesthe algorithmic optimizations from the low-level optimizationsusing Jinja2’s template macro functionalities. As architecture-speciﬁc optimizations are abstracted behind macros and tem- xaHyPE user solver

Optimized or generic kernels PDE terms (C/C++ or Fortran)

ExaHyPE core

Solver base classes (ADER-DG, FV, ...)Algorithms (time stepping, AMR, ...)Plotters for various ﬁ le formats E x a H y P E s p e c i ﬁ c a t i o n ﬁ l e E x a H y P E T oo l k i t L I B X S MM Peano

Grid management and heapsDistributed-memory parallelizationShared-memory parallelization steersgenerateswritten by userToolkit/prepared by Toolkit

Fig. 2. ExaHyPE’s structure with kernels (taken from [1]). plate functionalities, these can easily be extended to supportnew architectures, e.g. by extending support for AVX-512instruction sets. A more detailed description of the KernelGenerator and Toolkit design can be found in [7].In the following sections, we introduce three variants ofthe generic STP kernel with increasing level of optimization.Each variant is implemented as a new template in the KernelGenerator and reuses existing optimization macros. As thelater variants require an increasing amount of work from theend-user regarding the user functions, they are optional andcan be enabled by a ﬂag in the speciﬁcation ﬁle.III. O

PTIMIZED IMPLEMENTATION OF THE LINEAR

STP

KERNEL : VECTORIZATION AND TENSOR OPERATIONS

In this section we describe optimizations employed by theKernel Generator to improve the STP kernel performance.These optimizations include SIMD vectorization and perform-ing tensor contractions with

Loop-over-GEMM (LoG). Thealgorithm in this LoG variant of the STP kernel remainssimilar to the generic variant in order to preserve the user API.The use of slicing techniques to perform LoG by mappingtensor contraction to BLAS operations, in particular GEMM(matrix-matrix multiplication), has previously been exploredby Di Napoli et al. [11] and Shi et al. [12].

A. SIMD and data layout conﬂict

SIMD instructions require a unit-stride, i.e. the vectorizedinnermost loop must iterate over the fastest running index.For ExaHyPE, this causes conﬂicting data layout requirements.The kernels, in particular the STP kernel, perform the sameoperations for each quantity. The operations can thus bevectorized in the quantity dimension, corresponding to an

Array-of-Structure (AoS) data layout. In contrast, the userfunctions are evaluated pointwise for each quadrature nodewith the quantities as parameters. Vectorization can thereforeonly be performed by applying the user functions on multiplequadrature nodes at a time. This requires a

Structure-of-Array (SoA) data layout. We expect the user functions to be lesscomplex than the kernels and to be written by users who may be less familiar with code optimization. We therefore chose theAoS data layout for the ExaHyPE engine and thus prioritizeoptimization of the kernels.Using the AoS data layout, kernel operations can be vector-ized using compiler auto-vectorization with the correspondingpragmas or specialized code for critical operations.Furthermore, to fully exploit the SIMD capabilities, alltensors and matrices are memory aligned and their leadingdimensions are zero-padded to the next multiple of a SIMDvector length, such that each slice is also aligned in the slowestdimensions. It should be noted, that the zero-padding willincrease the number of ﬂoating point operations that need tobe performed. However, these come for free or even providea speedup as a single vector instruction containing the zero-padding will replace one or more scalar instructions and alsoremove the need for masking or loop spilling.Since the alignment and padding-size depend on the targetarchitecture supported SIMD instruction set, e.g. AVX-512 onSkylake, the Kernel Generator uses Jinja2’s template variablesand macros to abstract these optimizations in its templates.These variables are calculated by the Kernel Generator Con-troller and are used by Jinja2 when rendering the macrosand templates. The templates are rendered with the correctalignment and padding-size and future architectures can beadded by simply extending the macros’ deﬁnitions, as wasdone when adding Knights Landing and Skylake (AVX-512)support from previous Haswell (AVX2) optimizations.

B. Tensor products as Loop-over-GEMM with LIBXSMM

The computation-heavy operations performed in the STPkernel reﬂect derivatives on tensors along one spatial dimen-sion, as seen in the pseudocode of Fig. 1. These derivativesare computed as a matrix multiplication with the discretederivative operator D, as described in Sec. II-A. This reducesthe computation to tensor contraction operations.For example, let F k ,s be a 4-dimensional tensor with k =( k , k , k ) corresponding to the z , y , and x spatial dimensionsrespectively, and s giving the quantity of the state-vector q currently being computed. Then the discrete derivatives dF of F along the x dimension, take the form: ∀ k , ∀ s : dF k ,s = (cid:88) l D k ,l F k ,k ,l,s . where D is speciﬁc to the quadrature nodes and order. If weﬁrst only consider the loops on k and s , we need to computethe following intermediate result. ∀ k , ∀ s : dF (cid:48) k ,s = (cid:88) l D k ,l F (cid:48) l,s , which is a matrix multiplication with F (cid:48) = F k ,k , : , : thetensor F ’s matrix slice at a given ( k , k ) , and similarly dF (cid:48) = dF k ,k , : , : .Thus, we can perform the discrete derivative of F moreefﬁciently as a matrix multiplication on matrix slices of thetensors F and dF for ﬁxed k and k values. The computationf the tensor dF can be reformulated using a sequence ofBLAS MatMul operations, as ∀ k , ∀ k : MatMul ( F k ,k , : , : , D, dF k ,k , : , : ) This reformulation of tensor contraction using batched ma-trix multiplications on matrix slices of the tensors is exploredin more detail by Di Napoli et al. [11] and Shi et al. [12], e.g.Small matrix multiplications (matrix sizes are determinedby the polynomial order N of the ADER-DG scheme and thenumber m of quantities in the PDE) are thus the performance-critical part of the STP kernel and need to be performedas efﬁciently as possible. Exploitation of SIMD instructionsdemands a unit stride in the innermost loop of the matrixoperations. As the quantity dimension (index s in the example)is involved in all operations, using an AoS data layout impliesthat the quantity dimension should be the fastest running indexof the tensors.For highest-possible performance on Intel architectures, theKernel Generator relies on the library LIBXSMM [9], whichprovides highly optimized assembly code for the requiredsmall dense matrix multiplications ( gemm routines). i j kA(k,j,i), 3x2x3 tensorA(1,:,:), 3x2 matrix sliceA(:,1,:), 3x3 matrix slice offset slice stride Contiguous matrixMatrix with leading dimension padding == Fig. 3. Extracting matrix slices from a tensor.

Fig. 3 illustrates how to extract matrix slices from the arraystoring a 3-dimensional tensor. To extract a tensor matrixslice along the two fastest dimension (indexes i and j ) itis sufﬁcient to specify an offset, as the matrix slice is acontiguous subarray of the array storing the tensor. Doingso along other dimensions (indexes i and k in the ﬁgure)can be achieved by determining a slice stride , if one of thedimensions features a unit stride (here i ). The slice stridereﬂects the distance between the matrix rows that are storedin unit stride. LIBXSMM allows us to deﬁne the leadingdimension of a matrix with and without padding, e.g. to alignmatrix rows to cache line boundaries. We utilize these paddinghints by interpreting the slice stride as the padded row sizeof the matrix leading dimension. We thus efﬁciently restrictmatrix operations to tensor matrix slices without requiringextra memory transfers.Thus, all the tensor operations can be reformulated asa batch of matrix multiplications on subsequent slices inthe different dimensions. The zero-padding introduced earlier A v a il a b l e P e r f ( % ) GenericLoG (AVX-512)LoG (AVX2) M e m o r y S t a ll ( % ) Fig. 4. Available performance reached and percentage of pipeline slotsaffected by memory stalls on the test benchmark (on Intel Skylake, seeSec. VI) with the generic (green), LoG for AVX-512 (turquoise), and LoGfor AVX2 (light turquoise) kernels for order 4 to 11. ensures that each slice is also aligned for optimal performance.We stress again that we rely on the quantity dimension beingthe leading dimension of the tensors and of the matrix slices.This requires an AoS data layout.LIBXSMM is integrated into the Kernel Generator via acustom matmul

Jinja2 template macro as described in [7].This macro can also generate a generic triple-loop matrix mul-tiplication, if LIBXSMM is not available for an architecture.The use of a template macro facilitates integration of otherlibraries for small GEMM code, e.g. for other architectures.It also improves the readability of the template code byencapsulating the respective low-level optimizations.

C. Further optimizations

As a beneﬁt of using code generation, compile time con-stants and known constant parameters are directly hard codedinto the kernels – as are the dimensionality of the problem andthe location and exact name of the user functions. The latter,when combined with interprocedural optimizations (IPO) dur-ing compilation, allows inlining of the user functions in thekernels, even though these functions are virtual in the API. Inaddition, frequently used matrices or combinations of matrices,such as cross products or the inverses of quadrature weights,can be precomputed by the Kernel Generator. They can thenbe directly hard coded, thus saving redundant operations.

D. Intermediate performance results

Fig. 4 compares the performance (as percentage of availableperformance ) reached by the STP kernel in its generic variantvs. the LoG implementation that exploits the optimizations de-scribed in Sec. III. Benchmark and architecture (Intel Skylake)are described in detail in Sec. VI. The LoG setup was testedwith one setup optimized toward Skylake (AVX-512) and onetoward the older Haswell architecture (AVX2) for comparison. deﬁned as the ratio of the average GFlops reached by the maximal amountof GFlops possible on this hardware as described in Sec. VI s expected the performance of the generic setup is quitelow and quickly stagnates since only a fraction of the code canbe auto-vectorized by the compiler. In contrast, more than 90%of the ﬂoating point operations result from SIMD operations inthe LoG AVX-512 setup (measured via Intel VTune Proﬁler,see Fig. 9 in Sec. VI-A). Neither of the LoG setups showssigniﬁcant improvement at low order as the tensors are toosmall for the loop-vectorizations and matrix multiplications tobe efﬁcient. For higher order, they quickly improve to 8–11%of the available performance.However, the achieved performance does not reﬂect thepotential expected from vectorization. In fact, when comparingthe LoG setup for Skylake and the setup optimized towardthe previous Haswell architecture using AVX2, we obtainsimilar performance with a speedup of only 23–30% obtainedby going from AVX2 to AVX-512. If the benchmarks werecompute-bound, then a speedup closer to 100% could beexpected. This discrepancy is explained by the signiﬁcantamount of pipeline slots affected by memory stalls. While theywere expected, especially at low order when the arithmeticintensity of the ADER-DG scheme is lower, the percentage ofpipeline slots impacted by memory stalls stays above 41% inthe LoG AVX-512 test and 34% in the AVX2 setup. In thegeneric setup, however, the percentage goes down to around27%. At order 11 we observe a performance loss compared toorder 9 and 10 for the LoG AVX-512 setup, which is due tothe increase of memory stalls.IV. S PLIT

CK: D

IMENSION SPLITTED C AUCHY -K OWALEWSKY SCHEME

The LoG optimization of the STP kernel, as described inSec. III did not lead to the desired performance improvement.In this section we show that this is due to memory stalls, mostlikely because the required data is not held in the fast cachelevels. We therefore introduce a reformulation of the Cauchy-Kowalewsky scheme that reduces the memory footprint of theSTP kernel. Similar sum factorization approaches are used inother PDE solver frameworks, such as [13]–[16].

A. Motivation: LoG is bound by the capacity of the L2 cache

Performance benchmarks with the LoG kernel variant usingIntel’s VTune Ampliﬁer display a high amount of pipeline slotsimpacted by memory stalls that plateau toward 40% starting atorder 6 instead of steadily decreasing as expected (see Fig. 6).L2 cache overﬂow is to be expected around this order. Thebenchmarks were performed on Intel Xeon Platinum 8174CPUs, which have 1MB of L2 cache available per core.For the generic scheme from Sec. II-B the storage requiredfor temporary arrays is O ( N d +1 md ) , where N is again thepolynomial order of the ADER-DG scheme, m the number ofquantities in the PDE, and d its dimension. Thus, for a 3Dmedium-sized problem (e.g. m = 25 , d = 3 ), the 1MB limitwill be exceeded as soon as N = 6 and temporary arrays willfall out of the L2 cache.At high order the previous optimizations cannot be fullyexploited, because the code stops being compute-bound and p [ ∗ ] ← q [ ∗ ]ptemp [ ∗ ] ← ∗ ] ← d t ∗ q [ ∗ ] f o r ( o = 0 ; o < N; o++ ) { f o r ( d = 0 ; d < { f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { f l u x [ k ] ← computeF ( p [ k ] , dim=d ) } f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { ptemp [ k ] ← d e r i v e ( f l u x [ ∗ ] , c o o r d i n a t e =k , dim=d ) } f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { gradQ [ k ] ← d e r i v e ( p [ ∗ ] , c o o r d i n a t e =k , dim=d ) } f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { qtemp [ k ] ← qtemp [ k ] +computeNcp ( gradQ [ k ] , dim=d ) }} f o r ( k = 0 ; k < Nˆ 3 ; N++ ) { qavg [ k ] ← qavg [ k ] + ptemp [ k ] ∗ ( d t ˆ ( o +1) / ( o + 1 ) ! ) } swap ( p [ k ] , ptemp [ k ] ) } / ∗ Compute f a v g ∗ / favg [ ∗ ] ← f o r ( d = 0 ; d < { f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { f l u x [ k ] ← computeF ( p [ k ] , dim=d ) } f o r ( k = 0 ; k < Nˆ 3 ; k++ ) { favg [ k ] ← favg [ k ] +d e r i v e ( f l u x [ ∗ ] , c o o r d i n a t e =k , dim=d ) }} Fig. 5. Pseudocode for the cache aware SplitCK scheme for the STP. is instead dominated by memory stalls, mostly cache-relatedones. Therefore the algorithm described in Sec. II-B, used byboth the generic and LoG STP variants, needs to be improvedtoward cache awareness and a reduced memory footprint, evenat the cost of slightly more computations.

B. Reformulation of the Cauchy-Kowalewsky scheme

We reformulate the STP kernel’s algorithm along the generalparadigm to increase the number of memory accesses in lowerlevel caches by reusing arrays as soon as possible. Instead ofcomputing and storing the ﬂuxes for all dimensions at once, thetensor basis allows us to consider each dimension separately.To decrease the overall memory footprint we do not keep thewhole STP and its ﬂuxes in memory, but directly add thecontribution of each loop to the sum of the time averageddegrees of freedom and do not store the ﬂuxes. In the endwe exploit the linearity of the scheme to recompute the timeaveraged ﬂuxes from the time averaged degrees of freedom.The resulting scheme (outlined in Fig. 5) reduces the mem-ory footprint of the largest tensors by a full time dimension byperforming the time integration on-the-ﬂy. A further reductionby a factor 3 is achieved by reusing the same tensors for all A v a il a b l e P e r f ( % ) LoGSplitCK4 5 6 7 8 9 10 11Order304050 M e m o r y S t a ll ( % ) Fig. 6. Available performance reached and percentage of pipeline slotsaffected by memory stalls on the test benchmark (see Sec. VI) with the LoGvariant (turquoise) and the SplitCK one (blue) for order 4 to 11. three spatial dimensions. The SplitCK memory footprint isthus O ( N d m ) , compared to O ( N d +1 md ) previously. Cachelocality is also taken into consideration. However, the recom-putation of the time averaged ﬂux outside of the time loopadds the equivalent of almost one iteration to the computation,a cost that becomes increasingly insigniﬁcant at higher order. C. Performance evaluation

Fig. 6 shows that the new SplitCK scheme substantiallyreduces the memory stalls compared to the LoG scheme for allorders with a steady decrease as the order increases – whereasthe memory stalls for the LoG do not fall below 41% and evenincrease after order 9. The improvement in memory stalls isdirectly reﬂected by a performance that keeps growing withincreasing order and comes closer to matching the expectedspeedup compared to the generic kernels.V. A O S O A DATA LAYOUT

Removing the L2 cache bottleneck via the SplitCK schemenot only leads to direct performance increases compared to theLoG scheme – it enables further improvements by increasingthe ratio of vectorized ﬂoating point operations. To achievethis, we address the AoS-vs.-SoA data layout conﬂict byintroducing a hybrid data layout that can serve as AoS forthe GEMM kernels and also as SoA for the user functions.Similar hybrid data layouts were also explored in the PyFRNavierStokes solver [17] and are used by the YATeTo codegenerator [18] to optimize complex tensor operations.

A. Motivation

The data layout conﬂict between AoS and SoA and thechoice of an AoS data layout means that vectorization opportu-nities in the user functions are lost as they are computed point-wise using scalar instructions. To allow for SIMD instructions,they require inputs in an SoA data layout so that operationscan be performed on a vector of individual quantities. One way to get around this issue is to transpose the tensorson-the-ﬂy to switch the data layout from AoS to SoA and backbefore and after calling the user functions. This was testedin both linear and non-linear STP kernels for various appli-cations. It proved effective for complex non-linear scenarios,where the cost and complexity of the user functions were highenough that the performance gains of the vectorizations morethan compensated for the cost of the transpositions. However,the linear PDE systems in the targeted seismic applicationshave too simple (and inexpensive) user functions for such asolution to be effective, despite achieving the targeted highratio of vectorized ﬂoating point operations.By taking inspiration from the PyFR framework, wherea similar conﬂict occurs [17], and from the optimizationof tensor operations in YATeTo [18] (both are open sourcesoftware), we implemented a hybrid Array-of-Structure-of-Array (AoSoA) data layout. In this hybrid layout, the quantitydimension is put in between the spatial dimensions: for atensor A , instead of having the quantity dimension s beingthe fastest, i.e. A k,j,i,s (in AoS layout), or the slowest, A s,k,j,i (in SoA layout), our AoSoA layout mixed it in between thespatial dimension, resulting in a A k,j,s,i hybrid data layout.While this layout is slightly less intuitive to work with, itallows the kernels to keep working with a pseudo-AoS layoutand to trivially extract SoA subarrays for the user functions.To preserve alignment the fastest dimension is zero-paddedas with the AoS data layout. On AVX-512 architectures order 8is a sweetspot with no padding required, whereas order 9suffers from a particularly large padding overhead. B. Data layout in the kernel

In the kernel, whenever tensor operations are performed ontensor’s matrix slices, one of two cases can occur dependingon which dimensions the slices are taken from:In the ﬁrst case the slices are on the x dimension, nowthe fastest running index. When compared to the previouslyused AoS data layout, the matrix slices of the AoSoA tensorsare now transposed, as the quantity and x dimensions whereswapped in the tensors. Thus the matrix multiplications cansimply be transposed too by using C T = ( A · B ) T = B T · A T .Performing the matrix multiplications in this case only requiresto precompute the transpose of the second matrix B (e.g. thediscrete derivative operator matrix) and swap the tensor sliceand B T in the MatMul operation.In the second case the slices are on another dimension,i.e. on a slower running index than the quantity dimension.If both tensors have the same dimensions ordering and size,it is possible – when reformulating the tensor operations toLoop-over-GEMM – to extract bigger slices by fusing multipledimensions. The code excerpt in Fig. 7 shows a derivativealong the y -dimension (index j ) with the quantity and x -dimension fused (indexes s and i ), with an explicit matrixmultiplication instead of a LIBXSMM GEMM. Here, byfusing the quantity- and x -dimensions of the tensors whenextracting the slices, it does not matter in which order theywere in the initial tensor. This means that the same kindf matrix multiplications can be used as with the AoS datalayout. However, it forces minor API changes to ensure thatthe dimensions can be fused every time it is required. for ( int k = 0; k < 6; k++) // MatMul(Q, dudx, gradQ)// [j][is] slices, fuse i and s for ( int j = 0; j < 6; j++) for ( int l = 0; l < 6; l++) for ( int is = 0; is < 72; is++)gradQ[k*432+j*72+is] +=Q[k*432+l*72+is] * dudx[j*8+l]; Fig. 7. Tensor contraction expressed as Loop-over-GEMM with the quantity-and x -dimensions fused ( N = 6 and m = 12 hard-coded). Vectorization of the STP kernel’s operations can thus bepreserved with the AoSoA data layout and LIBXSMM canstill be used to optimize the

MatMul operations. However,the rest of the engine still expects an AoS data layout, thusthe kernel inputs use the AoS data layout and are transposedto the AoSoA layout and the outputs are transposed back toAoS at the end of the STP kernel. The performance impactof these transpositions is minimal compared to the cost ofthe kernel itself. In all cases this costs less performance-wisethan using on-the-ﬂy transposes between AoS and SoA ateach user functions call. Finally it could be avoided altogetherby switching the whole engine to an AoSoA data layout. Atthe time of this paper this has not been done due to APIcompatibility issues with other parts of the engine.

C. Vectorization of the user functions

In the AoSoA STP kernel, instead of looping over all spatialdimensions to call a pointwise user function, the x direction isnow excluded from the loop and the vectorized user functionis expected to be applied on the full x dimension in onecall. When calling a user function the input subarrays of theAoSoA tensors are a full line of quadrature nodes, whichcorresponds to an SoA data layout within the subarray. Herethe x dimension is the fastest running index.As the user functions are working on arrays in an SoAlayout, they can easily be vectorized by looping over thefastest running index, replacing scalar operations by SIMDoperations. The x dimension of the tensor is zero-padded tothe next SIMD vector register size and the tensors are memoryaligned, therefore each subarray is also memory aligned formaximal SIMD performance.The code excerpt in Fig.8 illustrates that in most applica-tions, the scalar code can be adapted in three steps to enablethe compiler auto-vectorization:1) Enclose the code in a for loop to vectorize running over VECTLENGTH .2) Add the new dimension to the arrays’ indexes knowingthat its size is

VECTSTRIDE .3) Use the correct pragma over the loop to enable auto-vectorization; here omp simd is used and an optionalalignment speciﬁer is given. //scalar formulation of flux_x void flux_x( double * Q, double * F) {F[0] = -(Q[0]+Q[3]+Q[4]);F[1] = -(Q[1]+Q[3]+Q[5]);F[2] = -(Q[2]+Q[4]+Q[5]);} //vectorized formulation of flux_x void flux_x_vect( double * Q, double * F) { for ( int i=0; i

VECTLENGTH , VECTSTRIDE , and

ALIGNMENT are integerconstants known at compile-time:

VECTLENGTH is the sizeof the x dimension without padding, VECTSTRIDE the sizewith padding and thus the stride of the SoA data layout,and

ALIGNMENT is the memory alignment adapted to thearchitecture. In most cases, a simple compiler enabled auto-vectorization proved sufﬁcient to achieve optimal performancewhen compared to user-written optimized code with Intelintrinsics for SIMD operations.However, to achieve optimal performance in the user func-tions, special consideration is required regarding the zero-padding used to preserve alignment. The zero-padded vectorscan cause issues in the user functions when zero is not avalid input value. This can lead to numerical errors suchas division by zero. Thus, in most cases, the user functionvectorization should only be performed on the non-paddeddimension size. But in doing so, masking or scalar loopspilling may be involved, decreasing performances. Conse-quently, while vectorizing a user function is trivial, fullyoptimizing its performance requires careful consideration ofpotential numerical issues. For this reason these optimizationshave been made available only as opt-in features.VI. R

ESULTS

We evaluate the performance of all the above-describedkernel optimizations when simulating the elastic wave equa-tions in ﬁrst order formulation on curvilinear boundary-ﬁttedmeshes, as described in [8]. The equations are characterizedby three quantities for particle velocity and six variables forthe stress tensor. Three material parameters deﬁne density andthe velocity of P- and S-waves. To incorporate the geometrywe store the transformation and its Jacobian in each vertex,adding a further nine parameters. Hence, we store m = 21 quantities at each integration point. As a scenario we run theestablished LOH1 benchmark [19], with a curvilinear mesh toﬁt the elements to the material parameter interface.ll tests were performed on SuperMUC-NG at LeibnizSupercomputing Centre. SuperMUC-NG uses two Intel XeonPlatinum 8174 CPUs per node, with 24 cores per socketrunning at 1.9GHz when using AVX-512. Each core has twoAVX-512 FMA units, thus the available performance per coreis . · · · . double precision GFlops/s. Note thatthe CPU base frequency is reduced by almost 30%, from2.7 GHz to 1.9 GHz, when using AVX-512 instructions. Thus,while these offer a theoretical speedup of a factor 8 in doubleprecision FLOPs compared to scalar instructions, the actualspeedup from vectorizing scalar operations is only a factor ≈ .Multi-node parallelism of the ExaHyPE engine is han-dled by the Peano Framework [10], which uses a hybridMPI+TBB (Intel Threading Building Blocks) approach. Dueto NUMA effects, each MPI rank should be restricted to asingle socket for optimal performance. Furthermore Peano’stask-based shared-memory parallelization begins to deterioratebeyond ≈

10 cores, at least for our linear-PDE setup. Weconclude that splitting each socket of 24 cores into 3 MPIranks of 8 cores parallelized with TBB is the optimal layoutused for large scale runs of the engine on SuperMUC-NG.As we focus only on single node performance in this work,we therefore run all benchmarks on 8 cores parallelized withTBB on a single socket.Performance is measured using Intel VTune Proﬁler 2019on the full application, only excluding the engine initializationstep, which is negligible in runtime for large scale runs. Thebenchmarks thus measure end-to-end performance, with allkernels and engine overhead included – though performancestays dominated by the STP kernel and its user functions. Thesetups are named after the STP kernel variant used: generic(Sec. II-B), LoG (Sec. III), SplitCK (Sec. IV) and AoSoASplitCK (Sec. V).

A. Instruction mix

Fig. 9 displays the SIMD instruction mix for the four opti-mization variants at increasing orders. “Scalar” corresponds tono vectorization; “512 bits” to SIMD packing with AVX-512instructions. As compiler auto-vectorization is used, 128- and256-bit packing is also used following compiler heuristics.For the generic setup, only a fraction of the FLOPs arepacked by compiler auto-vectorization and most are computedusing scalar instructions. This changes with the LoG andSplitCK setups were over 80% of the FLOPs are packed,either in 256- or 512-bit packed instructions. This conﬁrms thatprioritizing the vectorization of the kernel was the right choiceas it vectorizes most of the FLOPs. However, still close to10% of the FLOPs, mostly coming from the user functions, areperformed using scalar instructions. Using the AoSoA SplitCKkernel, where the user functions are vectorized too, brings theamount of FLOPs performed with scalar instructions down to2-4%, close to full vectorization of the code. https://doku.lrz.de/display/PUBLIC/Details+of+Compute+Nodes -O3 -xCORE-AVX512 -restrict -fma -ip -ipo Generic LoG

Scalar128 bits256 bits512 bits4 5 6 7 8 9 10 110255075100

SplitCK

Order AoSoA SplitCK

Fig. 9. Distribution of the packing size used to perform the ﬂoating pointoperations with the four kernel variants at orders N = 4 , . . . , . A v a il a b l e P e r f ( % ) GenericLoGSplitCKAoSoA SplitCK M e m o r y S t a ll ( % ) Fig. 10. Available performance reached and percentage of pipeline slotsaffected by memory stalls for all four kernel variants at orders N = 4 , . . . , . B. Available performance reached and memory stalls

Fig. 10 shows how much of the available performance wasreached by each benchmark and plots the evolution of the ratioof pipeline slots affected by memory stalls. As the ADER-DGscheme becomes more arithmetically intensive at higher order,the performance is expected to increase, while the memorystalls should steadily decrease for all kernel variants. On theother hand, improving the compute speed of the STP kernelsthrough vectorization and other optimizations increases thestress on memory and should result in an increased ratio ofmemory stalls compared to slower setups.The generic kernels quickly reach a performance plateauaround 3.8%, which is consistent with expected performancefrom an almost scalar code on Skylake architectures [20], [21].The LoG setup is constrained by memory stall issues in itsprogress, starting from N = 6 , as discussed in Sec. IV-A.Reducing the memory footprint of the STP kernel with theplitCK scheme proved effective as the memory stalls rationot only starts at a slightly lower value than in the LoG setupbut also decreases faster (and steadily) with the order, evengoing below the generic setup for N = 11 . Using the AoSoASplitCK scheme to vectorize the user function does not impactthe memory stalls signiﬁcantly, slightly more are observed asthis STP kernel variant is faster but the same trend as thedefault AoS SplitCK setup is preserved. Thus, both SplitCKbased setups increase in performances at higher orders with thesetup using the AoSoA reaching 22.5% at order N = 11 . Thisis a speedup of a factor 6 compared to the generic benchmarkat the same order. Taking into account the CPU base frequencyreduction this is above the maximal speedup expected onlyfrom the AVX-512 vectorization of the algorithm. This is madepossible by the reduced memory footprint of the new scheme.VII. C ONCLUSIONS

To exploit the potential of modern CPU architectures suchas Skylake, relying solely on the vectorization of the mostcritical operations is not sufﬁcient to improve performance.As we have seen in ExaHyPE, not taking memory and cachesinto account can signiﬁcantly impact the overall performanceof the code. Since the processor–memory performance gap isexpected to further increase, this issue will become even morerelevant on future architectures. Additionally, the accelerationpotential of AVX-512 implies that each missed vectorizationopportunity can considerably impact performance. The AoSoAlayout provided a considerable speedup despite only reducingthe ratio of scalar FLOPs from around 10% to around 3% andin spite of adding the cost of computing transposes.However, the modiﬁcation of the STP kernel to reduce thememory footprint and improve vectorization also impactedthe user API and increase the users’ programming effort, ifthey want to use the SplitCK and hybrid layout scheme. Theintroduced hybrid AoSoA data layout is also less intuitiveto use than the common AoS or SoA data layouts, whichcomplicates development and maintenance of the code.Code generation proved to be a very valuable tool tosolve these two issues. For ExaHyPE users, the new variants,SplitCK and AoSoA SplitCK, are optional and can easilybe introduced later by adapting an application developed forthe simpler default user API. For the engine developers, thetemplate macros and high level abstractions available in theKernel Generator greatly simplify the data layout change.VIII. A

CKNOWLEDGEMENTS AND F UNDING

EFERENCES[1] A. Reinarz, D. E. Charrier, M. Bader, L. Bovard, M. Dumbser,K. Duru, F. Fambri, A.-A. Gabriel, J.-M. Gallard, S. K¨oppel, L. Krenz,L. Rannabauer, L. Rezzolla, P. Samfass, M. Tavelli, and T. Weinzierl,“ExaHyPE: an engine for parallel dynamically adaptive simulations ofwave problems,”

Comp. Phys. Comm , accepted, arXiv:1905.07987.[2] V. A. Titarev and E. F. Toro, “ADER: Arbitrary High Order GodunovApproach,”

J. Scient. Comput. , vol. 17, no. 1, pp. 609–618, Dec 2002.[3] O. Zanotti, F. Fambri, M. Dumbser, and A. Hidalgo, “Spacetime adaptiveADER discontinuous Galerkin ﬁnite element schemes with a posteriorisub-cell ﬁnite volume limiting,”

Comp. Fluids , vol. 118, pp. 204–224,2015.[4] D. E. Charrier and T. Weinzierl, “Stop talking to me – acommunication-avoiding ADER-DG realisation,” arXiv e-prints , 2018,arXiv:1801.08682.[5] M. Dumbser, F. Fambri, M. Tavelli, M. Bader, and T. Weinzierl,“Efﬁcient implementation of ADER Discontinuous Galerkin schemesfor a scalable hyperbolic PDE engine,”

Axioms , vol. 7, no. 3, p. 63,2018.[6] D. E. Charrier, B. Hazelwood, E. Tutlyaeva, M. Bader, M. Dumbser,A. Kudryavtsev, A. Moskovskiy, and T. Weinzierl, “Studies on theenergy and deep memory behaviour of a cache-oblivious, task-basedhyperbolic PDE solver,”

Int. J. High Perf. Comp. App. , vol. 33, no. 5,pp. 973–986, 2019.[7] J.-M. Gallard, L. Krenz, L. Rannabauer, A. Reinarz, and M. Bader,“Role-oriented code generation in an engine for solving hyperbolicPDE systems,” in

Workshop on Software Engineering for HPC-EnabledResearch (SE-HER 2019) , SC19, Denver, 2019. [Online]. Available:https://arxiv.org/abs/1911.06817[8] K. Duru, L. Rannabauer, O. K. A. Ling, A.-A. Gabriel, H. Igel,and M. Bader, “A stable discontinuous Galerkin method for linearelastodynamics in geometrically complex media using physics basednumerical ﬂuxes,” arXiv preprint arXiv:1907.02658 , 2019.[9] A. Heinecke, G. Henry, M. Hutchinson, and H. Pabst, “LIBXSMM:accelerating small matrix multiplications by runtime code generation,”in

SC16: Int. Conf. for HPC, Networking, Storage and Analysis , 2016,pp. 981–991.[10] T. Weinzierl, “The Peano software—parallel, automaton-based, dynam-ically adaptive grid traversals,”

ACM Trans. Math. Softw. , vol. 45, no. 2,pp. 14:1–14:41, 2019.[11] E. Di Napoli, D. Fabregat-Traver, G. Quintana-Ort´ı, and P. Bientinesi,“Towards an efﬁcient use of the BLAS library for multilinear tensorcontractions,”

Appl. Math, Comput. , vol. 235, pp. 454–468, 2014.[12] Y. Shi, U. N. Niranjan, A. Anandkumar, and C. Cecka, “Tensor contrac-tions with extended BLAS kernels on CPU and GPU,” in . IEEE, 2016, pp. 193–202.[13] M. Kronbichler and K. Kormann, “A generic interface for parallel cell-based ﬁnite element operator application,”

Comp. Fluids , vol. 63, pp.135–147, 2012.[14] M. Homolya, R. C. Kirby, and D. A. Ham, “Exposing and exploitingstructure: optimal code generation for high-order ﬁnite element meth-ods,” arXiv e-prints , 2017, arXiv:1711.02473.[15] S. M¨uthing, M. Piatkowski, and P. Bastian, “High-performance imple-mentation of matrix-free high-order discontinuous Galerkin methods,”

Int. J. High Perf. Comp. App. , 2018.[16] J. Sch¨oberl, A. Arnold, J. Erb, J. M. Melenk, and T. P. Wihler, “C++11implementation of ﬁnite elements in ngsolve,”

Technical report , 2017.[17] F. D. Witherden, A. M. Farrington, and P. E. Vincent, “PyFR: an opensource framework for solving advection–diffusion type problems onstreaming architectures using the ﬂux reconstruction approach,”

Comp.Phys. Comm. , vol. 185, no. 11, pp. 3028–3040, 2014.[18] C. Uphoff and M. Bader, “Yet another tensor toolbox for discontinuousGalerkin methods and other applications,”

ACM Trans. Math. Softw. ,under review. [Online]. Available: http://arxiv.org/abs/1903.11521[19] S. M. Day and C. R. Bradley, “Memory-efﬁcient simulation of anelasticwave propagation.”

Seismol. Soc. Am., Bull. , vol. 3, pp. 520–531.[20] S. D. Hammond, C. T. Vaughan, and C. Hughes, “Evaluating the IntelSkylake Xeon processor for HPC workloads,”2018 Int. Conf. on HighPerf. Comp. & Sim. (HPCS)