[PDF] A compute-bound formulation of Galerkin model reduction for linear time-invariant dynamical systems

Abstract

This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query studies, where one needs to run a large number of simulations. This work introduces a reformulation, called rank-2 Galerkin, of the Galerkin ROM for LTI dynamical systems which converts the nature of the ROM problem from memory bandwidth to compute bound. We present the details of the formulation and its implementation, and demonstrate its utility through numerical experiments using, as a test case, the simulation of elastic seismic shear waves in an axisymmetric domain. We quantify and analyze performance and scaling results for varying numbers of threads and problem sizes. Finally, we present an end-to-end demonstration of using the rank-2 Galerkin ROM for a Monte Carlo sampling study. We show that the rank-2 Galerkin ROM is one order of magnitude more efficient than the rank-1 Galerkin ROM (the current practice) and about 970X more efficient than the full order model, while maintaining accuracy in both the mean and statistics of the field.

Full PDF

AA COMPUTE-BOUND FORMULATION OF GALERKIN MODELREDUCTION FOR LINEAR TIME-INVARIANT DYNAMICALSYSTEMS ∗ FRANCESCO RIZZI † , ERIC J. PARISH ‡ , PATRICK J. BLONIGAN ‡§ , AND

JOHN TENCER §¶ Abstract.

This work aims to advance computational methods for projection-based reducedorder models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, currentpractice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leadingto computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalableperformance on modern many-core and hybrid computing nodes. This weakness can be particularlylimiting when tackling many-query studies, where one needs to run a large number of simulations.This work introduces a reformulation, called rank-2 Galerkin, of the Galerkin ROM for LTI dynamicalsystems which converts the nature of the ROM problem from memory bandwidth to compute bound.We present the details of the formulation and its implementation, and demonstrate its utility throughnumerical experiments using, as a test case, the simulation of elastic seismic shear waves in anaxisymmetric domain. We quantify and analyze performance and scaling results for varying numbersof threads and problem sizes. Finally, we present an end-to-end demonstration of using the rank-2Galerkin ROM for a Monte Carlo sampling study. We show that the rank-2 Galerkin ROM is oneorder of magnitude more eﬃcient than the rank-1 Galerkin ROM (the current practice) and about970X more eﬃcient than the full order model, while maintaining excellent accuracy in both the meanand statistics of the ﬁeld.

Key words.

Projection-based model reduction, Galerkin, POD, elastic shear waves, memorybandwidth bound, compute bound, BLAS, HPC, generic programming, Kokkos, linear time-invariantdynamical systems, many-core, GPUs, C++.

AMS subject classiﬁcations.

1. Introduction.

The “extreme-scale” computing era we are living in—also re-ferred to by many as the dawn of exascale computing—is enabling a shift in theparadigm within which we approach new problems. We no longer target few simula-tions executed for speciﬁc choices of parameters and conditions, but are increasinglyinterested in exploiting high-ﬁdelity models to generate ensembles of runs to tackle many-query problems, such as uncertainty quantiﬁcation (UQ). These typically re-quire large sets of runs to adequately characterize the eﬀect of uncertainties in pa-rameters and operating conditions on the system response. Such an approach is keyto solve, e.g., design optimization and parameter-space exploration problems.If the system of interest is computationally expensive to query—for examplevery high-ﬁdelity models for which a single run can consume days or weeks on asupercomputer—its use for many-query problems is impractical or even impossible.Consequently, analysts often turn to surrogate models, which replace the high-ﬁdelitymodel with a lower-cost, lower-ﬁdelity counterpart. To be useful, surrogate modelsshould meet accuracy, speed, and certiﬁcation requirements. Accuracy ensures that ∗ Submitted to the editors September 5, 2020 † NexGen Analytics, 30N Gould St. Ste 5912, Sheridan, WY, 82801, USA ([email protected], [email protected]). ‡ Department of Extreme-Scale Data Science and Analytics, Sandia National Laboratories, Liver-more, CA, USA ([email protected], [email protected]) § These authors contributed equally to this work. ¶ Department of Thermal Sciences & Engineering, Sandia National Laboratories, Albuquerque,NM, USA ([email protected]) 1 a r X i v : . [ phy s i c s . c o m p - ph ] S e p F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER the surrogate produces a suﬃciently small error in target quantities of interest. Themaximum acceptable error is typically deﬁned by the user and is problem-dependent.Speed ensures that the surrogate evaluates much more rapidly than the full-ordermodel, and one typically has to consider a tradeoﬀ between speed and accuracy. Cer-tiﬁcation ensures that the error (and its bounds) introduced by the surrogate can beproperly quantiﬁed and characterized.Broadly speaking, surrogate models fall under three categories, namely (a) dataﬁts , which construct an explicit mapping (e.g., using polynomials, Gaussian processes)from the system’s parameters (i.e., inputs) to the system response of interest (i.e.,outputs), (b) lower-ﬁdelity models , which simplify the high-ﬁdelity model (e.g., bycoarsening the mesh, employing a lower ﬁnite-element order, or neglecting physics),and (c) projection-based reduced-order models (ROMs) , which reduce the number ofdegrees of freedom in the high-ﬁdelity model though a projection process. The mainadvantage of ROMs is that they apply a projection process directly to the equationsgoverning the high-ﬁdelity model, thus enabling stronger guarantees (e.g., of structurepreservation, of accuracy via adaptivity) and more accurate a posteriori error analysis(e.g., via a posteriori error bounds or error models). We can identify two main researchbranches within the ﬁeld of ROMs, one addressing nonlinear systems and the otherfocusing on linear systems. In this work, we focus on the latter, more speciﬁcallylinear time-invariant (LTI) dynamical systems, which are linear in state but have anarbitrary nonlinear parametric dependence.Projection-based model reduction of LTI systems is a mature ﬁeld in terms ofmethodological and algorithmic development [6, 4], but we argue that it lags be-hind in terms of implementation and computational advancement. On the one hand,numerous techniques have been developed to construct ROMs accounting for, e.g., ob-servability and controllability [28, 44, 29], H -optimality [13, 45, 17], and non-aﬃneparametric dependence [7, 12], and certify these ROMs via a posteriori error analy-sis [33, 35, 36]. As a result, it is possible to construct stable, accurate, and certiﬁedROMs for a wide class of LTI systems. On the other hand, the computational aspectsof solving these systems eﬃciently, especially in the context of many-query problems,has not achieved a comparable level of maturity. This statement is grounded on thefact that the standard formulation of ROMs for LTI dynamical systems expresses thestate as a rank-1 tensor (i.e., a vector) [6, 4], which makes the corresponding compu-tational kernels memory bandwidth bound, and therefore not well-suited for modernmulti-core processors or accelerators.This work presents our contribution towards improving the computational eﬃ-ciency of ROMs for LTI systems. We present and demonstrate a reformulation ofthe ROM problem for LTI dynamical systems such that we change its nature frommemory bandwidth bound to compute bound, making it more suitable for modernmulti- and many-core computing nodes. More speciﬁcally, we believe this work makesthe following contributions: • a new ROM formulation, referred to as “rank-2 Galerkin”, for LTI dynamicalsystems that is eﬃcient and scalable for many-query problems; • comprehensive numerical results demonstrating its performance and scaling; • a new open-source C++ code, developed from the ground up using theperformance-portable Kokkos programming model, to simulate the evolutionof seismic shear waves in an axi-symmetric domain; • detailed numerical examples based on the shear wave problem to demon-strate the rank-2 Galerkin ROM (which, to the best of our knowledge, alsoconstitutes the ﬁrst application of ROMs to seismic shear waves). OMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS §

2, we present the formulation and discussthe advantages and disadvantages of various Galerkin ROM formulations, in § § §

2. Mathematical Formulation.

We focus on problems expressible in semi-discrete form as(2.1) d x dt ( t ; η , µ ) = A ( η ) x ( t ; η , µ ) + f ( t ; η , µ ) , where x : [0 , t final ] × D η × D µ → R N is the state, x (0; η , µ ) = x is the initialstate, A ( η ) ∈ R N × N is the discrete system matrix parametrized by η ∈ D η ⊂ R N η , f : [0 , t final ] × D η × D µ → R N is the time-dependent forcing term also parametrizedby µ ∈ D µ ⊂ R N µ , and t final > N to be large, e.g.,hundreds of thousands or more, which is a suitable assumption for real applications.Since the state is a rank-1 tensor (i.e., a vector), hereafter we refer to a formulationof the form (2.1) as the rank-1 full-order model (Rank1FOM).The above formulation is linear in the state, but can have an arbitrary nonlin-ear parametric dependence. No assumption is made on the original problem leadingto (2.1), thus making it applicable to systems obtained from the spatial discretiza-tion of a partial diﬀerential equation (PDE) or problems that are inherently discrete.We assume the discrete system matrix, A ( η ), to be sparse, with its sparsity patterndepending on the problem and chosen discretization. We purposefully split the para-metric dependence into η and µ to highlight their diﬀerent roles: µ includes only theparameters that impact the forcing, while η includes all the others. This separationwill be helpful for the formulations in the sections below.A wide range of problems in science and engineering have governing equationsof the form (2.1). For example, the dynamics of a deforming structure can often bemodeled as linear, but the load distribution on the structure can be parametrized bynonlinear functions. This is typical in the ﬁeld of aeroelasticity, where the aircraftstructure is modeled by a linear PDE and the aerodynamic loads on the aircraft arehighly nonlinear. Similarly, the temperature of a structure may often be modeledwith the linear heat equations and boundary conditions set by nonlinear temperaturedistributions and/or heat loads. Nonlinear boundary conditions commonly arise in anumber of applications including electronics cooling, building thermal management,and industrial heat exchanger design. Neutral particle (neutron, photon, etc.) trans-port, when simulated via deterministic methods, often gives rise to linear systemsof this form. Acoustic waves are also modeled with a linear PDE, but can have annumber of nonlinear sources, most notably turbulent shear layers that arise in thewakes of cars, aircraft, and other moving objects. Linear circuit models are ubiqui-tous in electrical engineering and evolve time according to ODEs of the form (2.1)while commonly being driven by complex nonlinear forcing functions. Projection-based ROMs generate approximate solutionsof the full-order model (2.1) by (a) restricting the state to live in a low-dimensional(aﬃne) trial subspace, and (b) enforcing the residual of the ODE to be orthogonalto a low-dimensional test subspace. In this work, we limit the scope to the GalerkinROM, which is arguably the most popular methodology.The Galerkin ROM generates approximate solutions ˜ x ( t ; η , µ ) ≈ x ( t ; η , µ ) in alow-dimensional aﬃne trial subspace of dimension K (cid:28) N , i.e., ˜ x ( t ; η , µ ) ∈ V + x ref F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER with dim( V ) = K and where x ref ∈ R N deﬁnes the aﬃne oﬀset (also called thereference state). In this work, we take V = Range( Φ ), where Φ ≡ (cid:2) φ · · · φ K (cid:3) comprises K -orthonormal basis vectors. Such a basis may be obtained by properorthogonal decomposition, for example. For t ∈ [0 , t final ], η ∈ D η , µ ∈ D µ , the FOMsolution is thus approximated as(2.2) x ( t ; η , µ ) ≈ ˜ x ( t ; η , µ ) = Φ ˆ x ( t ; η , µ ) + x ref , where ˆ x : [0 , t final ] × D η × D µ → R K are referred to as the generalized coordinates.Equipped with (2.2), the Galerkin ROM proceeds by restricting the residual tobe orthogonal to the trial space yielding the following reduced-order model(2.3) d ˆ x dt ( t ; η , µ ) = ˆ A ( η )ˆ x ( t ; η , µ ) + Φ T f ( t ; η , µ ) + Φ T A ( η ) x ref , where ˆ A ( η ) ≡ Φ T A ( η ) Φ ∈ R K × K is the reduced (dense) system matrix, and we usedthe fact that Φ T Φ = I . When the basis is not orthogonal, the transpose operationsin (2.3) should be replaced with the pseudo-inverse. Note that the aﬃne oﬀset, x ref , isintroduced for generality, but it not always necessary: when unused or null, it simplydrops out of the formulation. Hereafter we refer to a system of the form (2.3) as therank-1 Galerkin ROM (Rank1Galerkin).The last term on the right in (2.3), Φ T A ( η ) x ref , can be eﬃciently evaluatedsince it is time-independent and only needs to be computed once for a given choiceof η . The term Φ T f ( t ; η , µ ), despite being seemingly more challenging because ofits time-dependence, can also be computed eﬃciently by considering two scenarios,namely one where f ∈ R N is a dense vector and one where it is sparse, with just a fewnon-zero elements. In the former case, since K is suﬃciently small, the best approachwould be to precompute and store the product Φ T f ( t ; η , µ ) for all the target timesover [0 , t final ]. For the other scenario, i.e. a sparse forcing, one can exploit its sparsitypattern at each time t by operating only on the corresponding rows of Φ . This allowsone to avoid precomputing the forcing at all times while maintaining computationaleﬃciency since only a few elements must be operated on. The Rank1FOM and Rank1Galerkin for-mulations are suitable when the objective is to perform a few individual cases, but be-come ineﬃcient for many-query scenarios involving large ensembles of runs on modernarchitectures. The reason is that computing the right-hand side of both Rank1FOMand Rank1Galerkin requires memory bandwidth-bound kernels, which impede an op-timal/full utilization of a node’s computing resources, especially on modern multi-and many-core computer architectures [16, 47, 11]. A partial solution would be torun the simulations in parallel, as demonstrated in [46], but the individual runs inthis approach would still be of the same nature.Indeed, the Rank1FOM in (2.1) is characterized by a standard sparse-matrixvector ( spmv ) product, which is well-known to be memory bandwidth bound due toits low compute intensity regardless of its sparsity pattern, see e.g. [5, 11]. Deﬁningthe computational intensity, I , of a kernel as the ratio between ﬂops and memoryaccess (bytes), the spmv kernel has approximately I ≈ nnz/ (6 N + 2 + 10 nnz ), where nnz is the number of non-zero elements.For the Rank1Galerkin in (2.3), the deﬁning term is the product ˆ A ( η )ˆ x ( t ; η , µ ) ofthe reduced system matrix with the vector of generalized coordinates, which requiresa dense general matrix vector ( gemv ) product. This kernel is also memory bandwidthbound [30], with an approximate computational intensity I ≈ /

4, when the matrixis square as in (2.3).

OMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS In light of the previous discussion, we now presentan alternative formulation and implementation that is computationally more eﬃcientfor many-query problems with respect to the parameter µ on modern many-corearchitectures. Let X : [0 , t final ] ×D η ×D Mµ → R N × M represent a set of M trajectoriesfor the FOM X ( t ; η , M ) ≡ (cid:2) x ( t ; η , µ ) . . . x M ( t ; η , µ M ) (cid:3) , for a given choice of η and where M = µ , . . . , µ M is the set of parameters deﬁningthe M forcing realizations driving the trajectories of interest.Similarly to (2.1), we can express the dynamics of these FOM trajectories as(2.4) d X dt ( t ; η , M ) = A ( η ) X ( t ; η , M ) + F ( t ; η , M ) , where X (0; η , M ) = (cid:2) x (0; η , µ ) · · · x M (0; η , µ M ) (cid:3) represents the initial condi-tion, and the forcing contribution is F ( t ; η , M ) ≡ (cid:2) f ( t ; η , µ ) · · · f ( t ; η , µ M ) (cid:3) .We refer to the system of the form (2.4) as the rank-2 FOM (Rank2FOM). TheRank1FOM introduced in (2.1) can be seen as a special case of Eq. (2.4) with M = 1.To derive the corresponding Galerkin ROM, we approximate the state as X ( t ; η , M ) ≈ ˜ X ( t ; η , M ) ≡ Φ ˆ X ( t ; η , M ) + X ref , where X ref ≡ (cid:2) x ref ( η , µ ) · · · x ref ( η , µ M ) (cid:3) ∈ R N × M , and apply Galerkin projec-tion to obtain(2.5) d ˆ X dt ( t ; η , M ) = ˆ A ( η ) ˆ X ( t ; η , M ) + Φ T F ( t ; η , M ) + Φ T A ( η ) X ref , where ˆ X : [0 , t final ] × D η × D Mµ → R K × M is a rank-2 tensor of time-dependentgeneralized coordinates. Hereafter we refer to (2.5) as the rank-2 Galerkin ROM(Rank2Galerkin). The term Φ T A ( η ) X ref can be eﬃciently evaluated since it is time-independent and only needs to be computed once for a given choice of η . For theterm Φ T F ( t ; η , M ), involving the forcing, considerations similar to those drawn forthe rank-1 Galerkin in Eq. (2.3) can be made. The term involving the system matrixis discussed below in more detail. The key question we pose at this point iswhat advantage, if any, is gained by employing the rank-2 formulation of the FOM andGalerkin ROM over the rank-1 alternatives? The answer is that, when evaluated in amany-query context, Rank2FOM has minimal advantages over Rank1FOM, whereasRank2Galerkin has major beneﬁts over Rank1Galerkin. We will demonstrate thisnumerically in § § A ( η ) X ( t ; η , M ) requir-ing a sparse-matrix matrix ( spmm ) kernel which, despite having a compute intensityhigher than spmv , remains memory bandwidth bound, see e.g. [2, 15]. This impliesthat Rank2FOM yields some but limited improvements over Rank1FOM.On the other hand, the Rank2Galerkin in (2.5) has a major advantage overRank1Galerkin because it changes the nature of the problem from memory bandwidthto compute bound. This stems from the fact that computing the term ˆ A ( η ) ˆ X ( t ; η , M )on the right-hand side of Rank2Galerkin in (2.5) involves a dense general matrixmatrix ( gemm ), which is one of the most studied kernels in dense linear algebra and is F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER known to be compute bound, see e.g. [30]. The approximate computational intensityfor gemm is I ≈ K/

16, since the reduced system matrix in Eq. (2.5) is square withsize K × K . This makes the rank-2 Galerkin ROM very well-suited for modern multi-and many-core computing nodes where a high computational intensity is critical forachieving good scaling and eﬃciency. To the best of our knowledge, this is the ﬁrstwork introducing this perspective and computational approach to ROMs. In addition to LTI prob-lems of the form (2.1), the rank-2 formulation may yield computational gains in otherreduced-order modeling contexts, e.g., linearized systems (sensitivity and stabilityanalysis), gradient-based solvers for nonlinear ROMs, and nonlinear ROMs with poly-nominal nonlinearities [22]. We believe these are exciting future research directions,but are outside the scope of the present manuscript.

The rank-2 formulation illustrated above enablesthe computation of multiple trajectories driven by diﬀerent realizations of the forcingterm given by diﬀerent choices of µ , but a ﬁxed choice of η . This can be generalizedto the case where the state and forcing are rank-3 tensors. Such an approach wouldenable computing multiple realizations of η and µ simultaneously . In this case, how-ever, a key obstacle to overcome would be eﬃcient assembly of the reduced systemmatrices for multiple realizations of η . The feasibility of this strictly depends on theparameterization of the system matrix. If the parameters can be decoupled from thematrix, then the reduced system matrices may be computed eﬃciently. If this decou-pling is not possible, one could leverage interpolation methods applied to the systemmatrix [3]. This rank-3 formulation would open up interesting opportunities froma computational standpoint, since one can draw ideas from the advances on tensoralgebra taking place within the deep learning community, where rank-3 tensors areat the core of formulations. Of particular interest are algorithmic developments forbatched matrix multiplication kernels [37, 1, 25, 20] and hardware innovations suchas tensor cores [26] and tensor processing units [21]. Since this is outside the scope ofthis work, we omit a full discussion on it and reserve it for a future work.

3. Test Case Description.

As a test case for our numerical experiments, wechoose the propagation of global seismic shear waves in spherical coordinates. Shearwaves are also called S-waves (or secondary) because they come after P-waves (orprimary). The main diﬀerence between them is that S waves are transversal (particlesoscillate perpendicularly to the direction of wave propagation), while the P waves arelongitudinal (particles oscillate in the same direction as the wave). Both P and Swaves are body waves, because they travel through the interior of the earth and theirtrajectories are aﬀected by the material properties, namely density and stiﬀness.We speciﬁcally focus on the axisymmetric evolution of elastic seismic shear waves.The governing equations in the velocity-stress formulation [18] can be written as: ρ ( r, θ ) ∂v∂t ( r, θ, t ) = ∂σ rφ ∂r ( r, θ, t ) + 1 r ∂σ θφ ∂θ ( r, θ, t )+ 1 r (3 σ rφ ( r, θ, t ) + 2 σ θφ ( r, θ, t ) cot θ ) + f ( r, θ, t ) ∂σ rφ ∂t ( r, θ, t ) = G ( r, θ ) (cid:18) ∂v∂r ( r, θ, t ) − r v ( r, θ, t ) (cid:19) (3.1) ∂σ θφ ∂t ( r, θ, t ) = G ( r, θ ) (cid:18) r ∂v∂θ ( r, θ, t ) − cot θr v ( r, θ, t ) (cid:19) , OMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS t represents time, r ∈ [0 , r earth ] is the radial distance bounded by the radius ofthe earth, θ ∈ [0 , π ] is the polar angle, ρ ( r, θ ) is the density, v ( r, θ, t ) is the velocity, σ rφ ( r, θ, t ) and σ θφ ( r, θ, t ) are the two components of the stress tensor remaining afterthe axisymmetric approximation, f ( r, θ, t ) is the forcing term, and the shear modulusis G ( r, θ ) = v s ( r, θ ) ρ ( r, θ ), with v s being the shear wave velocity. Note that we assumeboth the density and shear modulus to only depend on the spatial coordinates. Sucha formulation is referred to as 2.5-dimensional because it involves a 2-dimensionalspatial domain (a circular sector of the earth) but models point sources with correct3-dimensional spreading [18].This is an application of interest to geological and, in particular, seismic modeling.Previous studies focusing on shear waves modeling can be found, e.g., in [18, 9, 19, 43]and references therein. Since shear waves cannot propagate in liquids, the systemof equations in (3.1) is not applicable to the core region of the earth. Therefore,we solve Eq. (3.1) in the region bounded between the core-mantle boundary (CMB)located at r cmb = 3 ,

480 km, and the earth surface located at r earth = 6 ,

371 km.We discretize the 2D spatial domain using a staggered grid as shown in Figure 1 (a).Staggered grids are typical for seismic modeling and wave problems in general, see[19, 41, 42]. We use a second-order centered ﬁnite-diﬀerence method for the spatialoperators in Eq. (3.1), which is suﬃcient for the purposes of this study.The Rank2FOM formulation of Eq.(3.1) can be written as d X v dt ( t ; η , M ) = A v ( η ) X σ ( t ; η , M ) + F v ( t ; η , M ) d X σ dt ( t ; η , M ) = A σ ( η ) X v ( t ; η , M ) , (3.2)where X v ∈ R N v × M is the rank-2 state tensor for the velocity degrees of freedom, X σ ∈ R N σ × M is the rank-2 state tensor with the stresses degrees of freedom, A v ( η ) ∈ R N v × N σ is the discrete system matrix for the velocity, A σ ( η ) ∈ R N σ × N v is the onefor the stresses, and F v ( t ; η , M ) is the rank-2 forcing tensor. In this work, the ﬁnitediﬀerence scheme adopted leads to system matrices A v ( η ) and A σ ( η ) with about fourand two non-zeros entries per row, respectively. The corresponding Rank1FOM canbe easily obtained from (3.2) by setting M = 1.We use the following boundary conditions, similarly to [19]. Directly at the sym-metry axis, i.e., θ = 0 and θ = π , the velocity is set to zero since it undeﬁned here dueto the cotangent term in its governing equation. This implies that at the symmetryaxis the stress σ rφ is also zero. At the core-mantle boundary and earth surface weimpose a free surface boundary condition (i.e., waves fully reﬂect), by setting the zero-stress condition σ rφ ( r cmb , θ, t ) = σ rφ ( r earth , θ, t ) = 0. Note that this condition on thestress implicitly deﬁnes the velocity at the core-mantle boundary and earth surfaceand, therefore, no boundary condition on the velocity itself must be set there. Weremark that, diﬀerently than [19], we do not rely on ghost points to impose bound-ary conditions, because these are already accounted for when assembling the systemmatrices A v ( η ) and A σ ( η ).The Rank2Galerkin formulation of the system in Eq.(3.2) can be expressed as d ˆ X v dt ( t ; η , M ) = ˆ A v ( η ) ˆ X σ ( t ; η , M ) + Φ Tv F v ( t ; η , M ) d ˆ X σ dt ( t ; η , M ) = ˆ A σ ( η ) ˆ X v ( t ; η , M ) , (3.3) F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER (a) [ g / cm ], v s [ km / s ] r [ k m ] Core MantleUpper mantle and crust v s (b) Fig. 1 . (a): schematic of the axi-symmetric domain and staggered grid used for simulating theshear wave system in Eq. (3.1) , where the red-ﬁlled circles represent grid points where the velocity v is deﬁned, the gray-ﬁlled squares represent those for the stress σ rφ and the gray-ﬁlled triangles arethose for the stress σ θφ . (b): proﬁles of density ρ and shear velocity v s as a function of the radialdistance down to the core-mantle boundary as deﬁned by the PREM model. The shaded gray regionrepresents the simulation domain. where ˆ A v ( η ) ≡ Φ Tv A v ( η ) Φ σ ∈ R K v × K σ is the reduced system matrix for the velocityand ˆ A σ ( η ) ≡ Φ Tσ A σ ( η ) Φ v ∈ R K σ × K v is the one for the stresses, Φ v and Φ σ arethe orthonormal (i.e., Φ Tv Φ v = I and the same for Φ σ ) bases for the velocity andstresses, respectively. This enables, by default, a physics-based separation of the de-grees of freedom which inherently have considerably disparate scales. This beneﬁtsthe computation of the POD modes for the ROM, because they can be computedindependently for velocity and stresses, and no speciﬁc scaling is required. Further-more, it allows one to potentially use diﬀerent basis sizes to represent the velocity andstresses degrees of freedom. Note that we omitted the reference state because we donot use it here. The corresponding Rank1Galerkin can be obtained by setting M = 1.For time integration of both the FOM and ROM, we use a leapfrog integrator,where the velocity ﬁeld is updated ﬁrst, followed by the stress update. This is a com-monly used scheme for classical mechanics because it is time-reversible and symplectic.The initial conditions consist of zero velocity and stresses.To the best of our knowledge, this work provides the ﬁrst example of ROMsapplied to elastic shear wave simulations. However, it is important to acknowledgethat the application of ROMs to wave propagation is not new, and refer the readerto [32, 31]. These two studies focused on illustrating the feasibility and quantifyingthe accuracy of ROMs for acoustic wave simulations. Of particular interest is thediscussion presented in [31] on various techniques to make ROMs feasible for three-dimensional acoustic simulations, a non-trivial task due to the size of the problem.In a future study, one could explore, for example, the combination of some of themethods presented in [32] with our rank-2 formulation. We implemented therank-1 and rank-2 formulations of both the FOM and ROM for the shear problemabove in C++ using the Kokkos programming model [8]. The repository is publiclyavailable at github.com/fnrizzi/ElasticShearWaves. We rely on Kokkos to enable

OMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS . We remark that these are 1D models inthe sense that they only depend on the radial distance. For the single layer model,one needs to set the radial proﬁle of the density and shear wave velocity. For thebilayer model, one needs to set the depth of the discontinuity separating the layers,and proﬁles within each layer. The single and bilayer models are useful for testingand exploratory cases but are obviously not fully representative of the earth structure.The PREM and ak135f are two of the most commonly used models. These modelshave several discontinuities and complex proﬁles describing the material properties .Figure 1 (b) shows the proﬁles of the material properties as deﬁned by the PREMmodel. We designed the code in a modular fashion such that adding new materialmodels or extending existing ones is straightforward.The forcing term has two main deﬁning aspects, the form of the excitation signaland the location at which it acts. We support two kinds of signals commonly used forseismic modeling, namely a Ricker wavelet and the ﬁrst derivative of a Gaussian [34].The main parameter to set for these signals is the period, T , which is a key propertyto explore for characterizing seismic events. Similar to the design of the materialmodel, our implementation is easily extended to to include other kinds of signals. Asfor the location, due to the axisymmetric approximation, the seismic source cannotbe placed exactly on the symmetry axis [19], i.e. θ = 0 and θ = π . Therefore, for agiven depth, it is placed at the ﬁrst velocity grid point such that θ >

4. Results.

The results presented below are obtained with a machine containingtwo 18-core Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, each with a 24.75MB L3cache and 125GB total memory. If hyperthreading is enabled, the number of logicalthreads is 36 per CPU, yielding 72 total threads. We use GCC-8.3.1 and rely onkokkos and kokkos-kernels version 3.1.01. We use Blis-0.7.0 [40, 39] as the kokkos-kernels’ backend for all dense operations. In the present work we only run on CPUsenabling OpenMP as the default Kokkos execution space. A more detailed studyencompassing the role of the compiler and architecture (e.g., running on GPUs) is leftfor a future work. The results are divided into four parts. Scaling and performanceresults obtained for the FOM are presented in § § § § In this section, we pres-ent scaling and performance tests obtained for the shear wave simulations using theRank1FOM and Rank2FOM, see Eq.(3.2). For brevity, we omit the full details of the http://rses.anu.edu.au/seismology/ak135/ak135f.html https://seismo.berkeley.edu/wiki cider/REFERENCE MODELS F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER model and physical parameters used to carry out these FOM numerical experimentsand refer the reader to supplemental material. For this analysis the physical detailsare not important because they only come into play during the preprocessing stageand, therefore, have no impact on the performance.We consider M ∈ { , , , , , , } , thread counts 2 , , , , , ,

72, andtotal degrees of freedom N ∈ { , , , } . These valuesof N are the total degrees of freedom originating from choosing the following gridsfor the velocity: 256 × × × × M considering its trade oﬀ with the amount ofmemory required. Indeed, if one needs to save state data very frequently, using largevalues of M can yield a very large memory utilization for the FOM problem. Here,to make this FOM analysis feasible for M ≤

48 and since we are not interested in thephysical data, we disable all input/output related to saving data.For the Rank2FOM, we use a row-major ordering for the state matrix. Thischoice, given the OpenMP execution space chosen for Kokkos, yielded a performancesuperior to using column-major ordering. For the sake of clarity, we remark thatfor the Rank1FOM we do not need to choose the memory layout, because a rank-1state is just a 1D array. For both Rank1FOM and Rank2FOM we use the followingOpenMP aﬃnity:

OMP PLACES=threads and

OMP PROC BIND=spread .Figure 2 shows the computational throughput (GigaFLOPS) in panel (a), thememory bandwidth (GB/s) in panel (b) and the average time (milliseconds) per timestep in panel (c) for a representative subset of values of thread counts and M . Wemake the following observations. First, Figures 2 (a,b) show that as the problemsize increases and the thread count increases to 36 the computational throughputplateaus around 10 GFlops (which is in line with other typical sparse kernels [24])and the memory bandwidth around about 68 (GB/s), which is close to the maxbandwidth ( ∼

85 GB/s) of the machine in its current conﬁguration. This indicatesa good performance, and conﬁrms that the problem is memory bandwidth bound.Second, panel (b) shows that for the smallest problem size ( N = 0 . e N , and thread count, we observe thatthe performance improves if we use the rank-2 implementation. This is because ofthe higher arithmetic intensity of the spmm kernel in the Rank2FOM compared to spmv in the Rank1FOM. For the sake of the argument, consider the problem size N = 12 e M = 16allows us to simultaneously compute sixteen trajectories for only a seven-fold increasein the iteration time with respect to M = 1. The plots also seem to suggest thatthe beneﬁt of simulating multiple trajectories at the same time, i.e., using M ≥

2, isevident when going from M = 1 to M = 4, but then plateaus, suggesting that thereis a limiting, problem-dependent value of M after which no major gain is obtained.Fourth, panel (c) allows us to assess the strong scaling behavior of the FOM problem.If we ﬁx N and M , we observe a good scaling from 2 to 8 threads, but a degradationas we go from 8 to 36. This is expected and explained as follows: since the problemis memory bandwidth bound, and 8 threads are nearly enough for the computationalkernels to saturate the memory access (see panel (b)), one cannot expect a substantialimprovement in the performance if we increase the thread count. This section presents scaling and per-formance results for the rank-1 and rank-2 Galerkin ROMs introduced in Eq.(3.3).

OMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS

Fig. 2 . Performance results obtained for the full-order problem in Eq. (3.2) showing (a) thecomputational throughput (GFlops), (b) the memory bandwidth (GB/s), and (c) the average time(milliseconds) per time step, for various thread counts, problem and forcing sizes M . The limits ofthe vertical axes are the same as those used in Figure 3 to facilitate a visual comparison. F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER

To carry out this analysis, we make the following simpliﬁcations: (a) we use randomdata to ﬁll the reduced operators in Eq.(3.3) (a demonstration and discussion of theGalerkin ROM accuracy on a real shear wave problem is presented in § § K , for the velocity and stresses generalizedcoordinates, i.e., K v = K σ = K .We consider the following ROM sizes K ∈ { , , , , } , threadcounts 1 , , , , , , ,

72, and M ∈ { , , , , , , , , , , } . Re-call that the case M = 1 corresponds to the Rank1Galerkin while M ≥ M , despite seemingly large, are allfeasible for the ROM problem thanks to the small state dimensionality. To give an ex-ample, let us consider a Galerkin ROM problem of the form Eq.(3.3) with the largestsize, i.e., K = 4096, and suppose we collect a total of 500 reduced states. If we use M = 1000, which corresponds to simulating 1000 forcing realizations simultaneously,and considering we have one equation for the velocity and one for stresses, it wouldrequire only ≈

65 MB for the generalized coordinates, ≈

268 MB for the reducedsystem matrices and ≈

33 GB to store all the state snapshots. This is well within thememory capacity of modern computing nodes. Note that we are not accounting forthe size of the basis because the reduced operators can all be precomputed oﬄine.For both Rank1Galerkin and Rank2Galerkin we use a column-major layout for allthe operators; this is a suitable to interface to the BLAS library used by the kokkos-kernels to execute all the dense kernels. All runs are completed using the followingOpenMP aﬃnity:

OMP PLACES=cores and

OMP PROC BIND=true .Figure 3 shows the computational throughput (GigaFLOPS) in panel (a), thememory bandwidth (GB/s) in panel (b) and the average time (milliseconds) per timestep in panel (c) for a representative subset of values of threads and M . We make thefollowing observations. First, Figures 3 (a,b) clearly show that Rank1Galerkin is amemory bandwidth-bound problem, while Rank2Galerkin is a compute-bound prob-lem. This is evident because the case M = 1 yields a relatively limited throughput,namely O (10) Gﬂops, and increasing the ROM size K or the number of threads doesnot yield substantial improvement. On the contrary, when M ≥

16, using morethreads improves the throughout, which reaches a maximum of 1TFlops when thecores are fully utilized. Compared to the GFlops obtained for M = 1, the one for M = 16 is one order of magnitude larger, and increases to two orders of magnitude for M ≥ M = 512 and no major improvement is obtained using M = 1024. For example, for K = 2048, using M = 512 and 36 threads yields about1TFlops, which remains the same for M = 1024. Third, panel (c) allows us to assessthe strong scaling behavior. We observe excellent scaling when M ≥

16 and the ROMsize is suﬃciently large, while a poor scaling is generally obtained when M = 1, whichis a direct consequence of the diﬀerent nature of the kernels. The resultsabove highlight the excellent performance of Rank2Galerkin and quantiﬁed the maindiﬀerences between it and Rank1Galerkin. A natural question arises: when shouldan analyst prefer Rank2Galerkin over Rank1Galerkin? Obviously, this question onlymakes sense in the context of a many-query study where the interest is in simulating anensemble of trajectories corresponding to multiple realizations of the forcing function,as in a typical forward propagation study in UQ.Suppose we are given a target ROM size, K , and need to simulate an ensemble of OMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS Fig. 3 . Performance results obtained for the ROM problem in Eq. (3.3) showing (a) the com-putational throughput (GFlops), (b) the memory bandwidth (GB/s), and (c) the average time (mil-liseconds) per time step, for various thread counts, forcing size M , and number of modes K . P trajectories (where P is large, P (cid:29)

10) from realizations of the forcing term, whileneeding to meet these two constraints: (a) we have a limited budget of cores avail-able, e.g., a single node with 36 physical cores; (b) due to, e.g., memory constraints,1024 is the maximum feasible number of trajectories that can simultaneously live on4

F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER the node. These constraints are arbitrarily set here for the sake of the argument,but are reasonable values nonetheless. What combination of thread count, n , andnumber of simultaneous trajectories, M , would be the most eﬃcient to obtain those P samples while satisfying the given constraints? For example, one could launch 36single-threaded runs in parallel, each using M = 1, and then repeat until all P runsare completed. This implies that, at any time, all 36 cores of the node would beoccupied, and 36 realizations of the forcing term would be running simultaneously,which means that both the core budget and the memory constraint are satisﬁed. Aminor variation would be to run 18 two-threaded runs at the same time each using M = 1. This would still satisfy both the core budget and the memory constraint. Themost interesting scenarios arise when we vary M . Generalizing, this is a (discrete)constrained optimization problem, since we need to optimize over number of threadsand M . Solving this in a general context is outside the scope of this work, but weprovide the following insights.Let τ ( K, n, M ) represent the runtime to complete a single

Galerkin ROM sim-ulation of the form Eq.(3.3) with ROM size K v = K σ = K , using n threads and agiven value M . It follows that the total runtime to complete trajectories for P forcingrealizations with a budget of 36 threads can be expressed as(4.1) τ P ( K, n, M ) = τ ( K, n, M ) P n M , because n is the number of independent runs executing in parallel on the node witheach run responsible of computing M trajectories. We can deﬁne the following metric(4.2) s ( K, n, M ) = τ P ( K, , τ P ( K, n, M ) = τ ( K, , τ ( K, n, M ) Mn , where s ( K, n, M ) > s ( K, n, M ) <

1. This metric can be interpreted as aspeedup (or slowdown) factor.Figure 4 shows a heatmap visualization of s ( K, n, M ) computed for various valuesof M and n , and ROM sizes K ∈ { , , , } . To generate these plots, weused the runtimes obtained in § K = 256, we observe that using Rank2Galerkin with M = 32 and n = 2 yields s (256 , ,

32) = 12 .

98, which means that for this setup the Rank2Galerkin is about13 times more eﬃcient than using Rank1Galerkin with n = 1. Depending on thegiven ROM size, similar conclusions can be made. The second key observation isthat as the ROM size increases, it becomes increasingly more convenient to rely onRank2Galerkin. For example, if we ﬁx n = 12 and M = 256, we see that when K = 256 we have a 10X speedup with respect to Rank1Galerkin with n = 1 (seeFigure 4 (a)), but we reach a 26X speedup for K = 2048, see Figure 4 (d).The above results suggest that, if the goal is to compute an ensemble of trajec-tories for diﬀerent realizations of the forcing term and the main cost function is theruntime, Rank2Galerkin should always be preferred choice. We now discuss the ac-curacy of the Galerkin ROM for the elastic shear wave test case under consideration.We consider the scenario where the period, T , parameterizing the forcing signal variesover a ﬁxed interval, T ∈ [31 , OMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS

Fig. 4 . Heatmap visualization of s ( K, n, M ) (see Eq. 4.2) computed for ROM sizes K = 256 (a), (b), (c) and (d), and various values of M and n . On each plot, the solid black lineseparates the cases where s ( K, n, M ) > (i.e., where Rank2Galerkin is more convenient) from thosewhere s ( K, n, M ) < (i.e., where Rank1Galerkin is more convenient). The white region in eachplot identiﬁes the non-admissible cases, i.e., combinations violating the constraints listed in § ROM can approximate the FOM predictions as T varies over that interval. Note thatsince here the focus is on quantifying the accuracy , it does not matter which ROMformulation is used. We use Rank2Galerkin to perform the analysis because it is moreeﬃcient, but the same results would be obtained using Rank1Galerkin.The complete deﬁnition of the problem is as follows. We use the PREM asmaterial model, a total simulation time of t final = 2000 (seconds) and a time stepsize ∆ t = 0 . T . This delay,combined with the total integration time, represents a system that is at rest for asmall period of time, and then excites and decays due to the action of the forcing. Forthe FOM runs, we use the grid with N = 3 , ,

168 total degrees of freedom, whichwe recall stems from using a 512 × T = 31 and the PREM material model, it is the smallest one (in powers of two)satisfying the numerical dispersion criterion we use in our code. To compute the POD modes for the ROM, we proceed as follows: we run theFOM at T = 35 ,

65, save the velocity and stress states every 2 seconds, and thenmerge the data from both training points into a single snapshot matrix which is used To ensure the numerical dispersion is satisﬁed, for any given grid we require its maximum gridspacing, h , to satisfy h ≤ T v mins /λ where λ is the target number of grid points per wavelength whichwe set to 10 and v mins is the minimum shear velocity in the material model. F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER to compute the POD modes via SVD. We deliberately choose only two training pointsfor two main reasons: (a) it is the smallest training set one can choose to sample a one-dimensional parameter space; and (b) it mimics a realistic situation where limited dataare available. This occurs, for example, when a computational model is very expensiveto run such that an analyst can only aﬀord a few runs, or where experimental datais collected at just a few parameter points and more experiments are not aﬀordable.For such cases, in fact, relying on data-driven models requiring large training datasetsis unfeasible. We thus believe it is interesting to assess whether one can construct asuﬃciently accurate ROM given limited data, which, if positive, would highlight thefull power of ROMs versus purely data-driven models.Our goal is to characterize both a reproductive and predictive scenario. A repro-ductive scenario occurs when the ROM accuracy is quantiﬁed and evaluated at thesame (training) points in the parameter space used to generate the basis functions ofthe ROM. In this case, this means evaluating the ROM accuracy for T ∈ { , } . Apredictive scenario is where the ROM accuracy is evaluated at parameter instances(test points) that diﬀer from those used to generate the basis functions of the ROM.Here, the predictive analysis is run at 10 random samples over the parameter range(35 , T = 31 and T = 69, outside the training range. We remark thatthis is a key choice because it allows us to assess the extrapolation capabilities of theROM, which generally is one of the main weaknesses of purely data-driven methods.To quantify the accuracy of the ROM, we rely on the following relative error(4.3) E ‘ = k x ( t final ; η , µ ) − ˜ x ( t final ; η , µ ) k k x ( t final ; η , µ ) k where x ( t final ; η , µ ) is the FOM state (for either velocity or stresses) computed atthe ﬁnal time t final for the given parameters η , µ , and ˜ x ( t final ; η , µ ) is the state ﬁeldreconstructed from the ROM results.Figure 5 (a) shows the relative error (4.3) computed for the velocity degreesof freedom as a function of the period T . The results are shown for ( K v , K σ ) ∈{ (311 , , (369 , , (415 , , (436 , } , corresponding to 99 . . . . within the training interval onewould need to use at least ( K v , K σ ) = (369 , T = 31. To have small errorsboth within the training interval and in the extrapolation regime, one would need atleast ( K v , K σ ) = (415 , T . For this application, the convective nature of the problemcan force the need for a large number of modes to accurately predict the dynamics.However, this is not a weakness but can be an advantage in this case because, as dis-cussed in § K increases (up to value which is typically large, O (10000)).For the sake of argument, Figure 5 (b) shows the same relative error metriccomputed between the FOM and the state obtained via nearest-neighbor and linearinterpolation using the FOM states at T = 35 and T = 65 to compute the approxima-tion. A linear interpolation is the best we can do with two data points. Figure 5 (b) OMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS

30 35 40 45 50 55 60 65 70

Forcing period (sec) E f o r v e l o c i t y Train pointsTrain points K v = 311, K = 276 K v = 369, K = 343 K v = 415, K = 394 K v = 436, K = 417 (a)

30 35 40 45 50 55 60 65 70

Forcing period (sec) E f o r v e l o c i t y Nearest neighborsLinear (b)(c)

Fig. 5 . Panel (a) shows the relative error, see Eq. (4.3) , computed for the velocity degrees offreedom between the FOM solution and the corresponding ROM-based approximation as a functionof the period T and for various ROM sizes K . Panel (b) shows the same relative error metriccomputed between the FOM and the state approximation obtained via nearest neighbor and linearinterpolation using the FOM state at T = 35 and T = 65 as training points. Panel (c) shows thewave ﬁeld computed with the FOM (left), the ROM with (436 , modes (middle), as well as theirerror ﬁeld (right) at t final = 2000 (seconds) for the extrapolation point T = 69 . shows that both nearest-neighbor and linear interpolation have quite poor accuracyoverall, with errors reaching up to 90% inside the training range and consistentlylarger than 15% at the extrapolation points. This indicates that, for this test case,the ROM is much more robust and accurate than interpolation.Panel (c) shows a contour plot of the wave ﬁeld at the ﬁnal time t final = 2000 (sec-onds) for the source signal period T = 69 (recall that this is an extrapolation point)computed with the FOM (left) and the ROM with ( K v , K σ ) = (436 , π/ < θ < π/ F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER use for a polynomial approximation to match the ROM accuracy? What is the bestperforming surrogate when the number of training points is larger than two? Whatis the trade oﬀ between the cost of computing the surrogate and its accuracy? Howwould the results change if one considered a higher-dimensional parameter space?

We conclude the results demonstrating the appli-cation of Rank2Galerkin to a many-query analysis. Using the same problem deﬁnitiondescribed in § T , parameterizing the forcing signal is un-certain and follows a uniform distribution U (31 , P = 512 realizations of T by randomly sampling U (31 , K v = 436 and K σ = 417(since these values proved to be accurate, see Figure 5(a)), and select the most eﬃcientcombination of threads and M by leveraging the performance results shown in § K = 512 is a good approximation of K v and K σ usedhere), we choose M = 64 and n = 4 threads to run each Rank2Galerkin simulation.For the FOM, we use four threads n = 4 and M = 4 for each run. This choice is madeby using the FOM results in § § § SM2 of the supplemental material.Figure 6 shows the seismograms—and representative statistics computed usingthe ensemble of 512 runs—collected at locations on the earth surface, r = 6371 km,with polar angles θ = 10 ◦ and 60 ◦ . The left column of Figure 6 shows the meanand the 5-95th percentile bounds over the full time domain computed with the ROMusing K v = 436 and K σ = 417. The right column of Figure 6 shows magniﬁedviews of speciﬁc time windows displaying curves for diﬀerent percentiles as well as themean and error bars computed with the FOM. The predictions are characterized bysubstantial variations, stemming from the complex pattern of reﬂections, refractionsand interference of the shear waves through the domain, and the ROM accuratelyapproximates the FOM results over the full time domain in both its mean trendand statistics. Note that in this case, the inherently complex dynamics of the wavesprevents one from being able to characterize the waveﬁeld variability just by evaluatingthe extreme values of the forcing period T . It is therefore critical to sample the fullrange of T to obtain accurate statistics, which are fundamental to potentially assessrisk and extreme events probabilities.To quantify the computational gain, we can reason as follows. Rank2Galerkin.

Since we rely on a computing node with 36 physical cores, and useRank2Galerkin with four threads and M = 64, we can launch 8 such simulations inparallel to compute 8 ×

64 = 512 trajectories. These 8 runs execute in parallel, andtherefore ﬁnish in approximately the same time as a single run with four threads and M = 64, which is about 7 . Rank2FOM.

For the FOM, we use individual runs with four threads and M = 4,implying that we can evaluate 8 runs in parallel, and complete the full ensemble of P trajectories with 16 sets of runs. Each run takes about 455 seconds, so the totalruntime is approximately 455 ×

16 = 7280 seconds. For the current problem andsetup, Rank2Galerkin is thus about 970 times faster than the FOM.

Remark M = 1, thus needing about 14 setsof such runs to complete the full ensemble of P trajectories. The approximate runtime OMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS Fig. 6 . Seismograms with representative statistics obtained at receivers on the earth surface, r = 6371 km, with polar angles θ = 10 ◦ (ﬁrst row), ◦ (second row), computed from the UQanalysis described in § of one such Rank1Galerkin is about 8 seconds, so the total runtime is approximately8 ×

14 = 112 seconds. Rank2Galerkin is 15 times faster than Rank1Galerkin.

Remark m r (cid:29) m c where m r and m c are the number of rows andcolumns, respectively) can be eﬃciently done leveraging one the algorithms scalingwith with m c , see, e.g., chapter 45 in [14].We can summarize the following beneﬁts in using the Galerkin ROM for thisseismic application: ﬁrst, if needed, we can query the system very eﬃciently for eithera single or several values of T to explore diﬀerent statistics; second, the dimensionalityof the ROM is small enough such that one can store the full time evolution of thegeneralized coordinates and use them later on to eﬃciently reconstruct the full waveﬁelds at target times; third, computing the seismograms from the ROM data can bedone very eﬃciently because we can just use individual rows of the POD basis toreconstruct the ﬁeld at target points on the earth surface thus avoiding the need toload the full basis matrix in memory.

5. Conclusions.

We presented a reformulation, called rank-2 Galerkin, of theGalerkin ROM for LTI dynamical systems to convert them from memory bandwidth-to compute-bound problems. We described the details of the formulation and high-lighted its beneﬁts compared to the formulation adopted so far in the community.0

F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER

Rank-2 Galerkin is suitable to solve many-query problems driven by the forcing termin a scalable and eﬃcient way on modern computing node architectures. We alsodiscussed its natural extension, called rank-3 Galerkin, which would allow one to alsoexplore the parameters in the system matrix, and introduced the idea of exploiting thetechnological advancement of computing kernels used in deep learning, where tensorsare key data structures.We demonstrated rank-2 Galerkin using, as a test case, the simulation of elasticseismic shear waves in an axisymmetric domain. We showed the excellent scalabilityof rank-2 Galerkin with respect to the number of cores, which is a result of thenature of the underlying computational kernels. We quantiﬁed the accuracy of theROM for this application as the period of the forcing varies, and showed that forthis problem a Galerkin ROM has excellent accuracy in both the reproductive andpredictive scenarios. Using a Monte Carlo sampling exercise, we demonstrated thequantiﬁcation of the uncertainty in the waveﬁelds stemming from a uncertain forcingperiod, and showed that rank-2 Galerkin is about 970 times faster than the FOMwhile being extremely accurate in both the mean and statistics of the ﬁeld.As mentioned in the paper, several interesting future directions arise. One cancompare diﬀerent types of surrogates for the seismic problem considered here, extendto 3D seismic modeling, explore eﬃcient implementations of the rank-3 formulation,the feasibility of applying a reformulation similar to one presented here to ROMs ofnonlinear systems, and the potential gains of applying the rank-2 or rank-3 GalerkinROMs to inverse problems in geophysics. Another critical aspect is the practicalchallenge related to implementing rank-2 Galerkin in a performance-portable way asa library to be used by other applications. The latter topic is of particular interestto us: we are currently working on implementing the rank-2 Galerkin formulation inour ROM library Pressio . Acknowledgments.

The authors thank Irina Tezaur for her helpful feedback.This paper describes objective technical results and analysis. Any subjective viewsor opinions that might be expressed in the paper do not necessarily represent theviews of the U.S. Department of Energy or the United States Government. This workwas funded by the Advanced Simulation and Computing program and the LaboratoryDirected Research and Development program at Sandia National Laboratories, a mul-timission laboratory managed and operated by National Technology and EngineeringSolutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc.,for the U.S. Department of Energy’s National Nuclear Security Administration undercontract DE-NA-0003525.

REFERENCES[1]

A. Abdelfattah, A. Haidar, S. Tomov, and J. Dongarra , Performance, design, and auto-tuning of batched gemm for gpus , in International Conference on High Performance Com-puting, Springer, 2016, pp. 21–38.[2]

K. Ahmad, A. Venkat, and M. Hall , Optimizing lobpcg: Sparse matrix loop and datatransformations in action , in Languages and Compilers for Parallel Computing, C. Ding,J. Criswell, and P. Wu, eds., Cham, 2017, Springer International Publishing, pp. 218–232.[3]

D. Amsallem, R. Tezaur, and C. Farhat , Real-time solution of linear computational prob-lems using databases of parametric reduced-order models with arbitrary underlying meshes ,Journal of Computational Physics, 326 (2016), pp. 373–397. https://github.com/PressioOMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS [4] U. Baur, P. Benner, and L. Feng , Model order reduction for linear and nonlinear systems:A system-theoretic perspective , Archives of Computational Methods in Engineering, 21(2014), pp. 331–358, https://doi.org/10.1007/s11831-014-9111-2.[5]

N. Bell and M. Garland , Eﬃcient sparse matrix-vector multiplication on CUDA , NVIDIATechnical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.[6]

P. Benner, S. Gugercin, and K. Willcox , A survey of projection-based model reductionmethods for parametric dynamical systems , SIAM Review, 57 (2015), pp. 483–531, https://doi.org/10.1137/130932715.[7]

T. Bui-Thanh, K. Willcox, and O. Ghattas , Parametric reduced-order models for proba-bilistic analysis of unsteady aerodynamic applications , AIAA Journal, 46 (2008), pp. 2520–2529, https://doi.org/10.2514/1.35850.[8]

H. Carter Edwards, C. R. Trott, and D. Sunderland , Kokkos , J. Parallel Distrib. Com-put., 74 (2014), pp. 3202–3216, https://doi.org/10.1016/j.jpdc.2014.07.003.[9]

E. Chaljub and A. Tarantola , Sensitivity of ss precursors to topography on the upper-mantle 660-km discontinuity , Geophysical Research Letters, 24 (1997), pp. 2613–2616,https://doi.org/10.1029/97GL52693.[10]

A. M. Dziewonski and D. L. Anderson , Preliminary reference earth model , Physics ofthe Earth and Planetary Interiors, 25 (1981), pp. 297 – 356, https://doi.org/10.1016/0031-9201(81)90046-7.[11]

A. Elafrou, G. Goumas, and N. Koziris , Performance analysis and optimization of sparsematrix-vector multiplication on intel xeon phi , in 2017 IEEE International Parallel andDistributed Processing Symposium Workshops (IPDPSW), 2017, pp. 1389–1398.[12]

Grepl, Martin A. and Patera, Anthony T. , A posteriori error bounds for reduced-basisapproximations of parametrized parabolic partial diﬀerential equations , ESAIM: M2AN,39 (2005), pp. 157–181, https://doi.org/10.1051/m2an:2005006.[13]

S. Gugercin, A. Antoulas, and C. Beattie , H model reduction for large-scale linear dynam-ical systems , SIAM Journal on Matrix Analysis and Applications, 30 (2008), pp. 609–638,https://doi.org/10.1137/060666123.[14] L. Hogben , ed.,

Handbook of Linear Algebra , CRC Press, Boca Raton, FL, USA, 2006.[15]

C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E. Kurt, I. Nisa,S. Sabhlok, U. V. C¸ ataly¨urek, S. Parthasarathy, and P. Sadayappan , Eﬃcientsparse-matrix multi-vector product on gpus , in Proceedings of the 27th International Sym-posium on High-Performance Parallel and Distributed Computing, HPDC 18, New York,NY, USA, 2018, Association for Computing Machinery, p. 6679, https://doi.org/10.1145/3208040.3208062.[16]

A. Hutcheson and V. Natoli , Memory bound vs. compute bound: A quantitative study ofcache and memory bandwidth in high performance applications , 2011.[17]

D. Hyland and D. Bernstein , The optimal projection equations for model reduction and therelationships among the methods of wilson, skelton, and moore , IEEE Transactions onAutomatic Control, 30 (1985), pp. 1201–1211.[18]

H. Igel and M. Weber , Sh-wave propagation in the whole mantle using high-order ﬁnitediﬀerences , Geophysical Research Letters, 22 (1995), pp. 731–734.[19]

G. Jahnke, M. S. Thorne, A. Cochard, and H. Igel , Global SH-wave propagation usinga parallel axisymmetric spherical ﬁnite-diﬀerence scheme: Application to whole mantlescattering , Geophysical Journal International, 173 (2008), pp. 815–826.[20]

L. Jiang, C. Yang, and W. Ma , Enabling highly eﬃcient batched matrix multiplicationson sw26010 many-core processor , ACM Trans. Archit. Code Optim., 17 (2020), https://doi.org/10.1145/3378176.[21]

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,S. Bhatia, N. Boden, A. Borchers, et al. , In-datacenter performance analysis of atensor processing unit , in Proceedings of the 44th Annual International Symposium onComputer Architecture, 2017, pp. 1–12.[22]

I. Kalashnikova, S. Arunajatesan, M. F. Barone, B. G. van Bloemen Waanders, andJ. A. Fike , Reduced order modeling for prediction and control of large-scale systems. ,https://doi.org/10.2172/1177206.[23]

B. L. N. Kennett, E. R. Engdahl, and R. Buland , Constraints on seismic velocities inthe Earth from traveltimes , Geophysical Journal International, 122 (1995), pp. 108–124,https://doi.org/10.1111/j.1365-246X.1995.tb03540.x.[24]

A. Li, R. S. Hammad Mazhar, and D. Negrut , Comparison of spmv performance on matriceswith diﬀerent matrix format using cusp, cusparse and viennacl , (2015).[25]

X. Li, Y. Liang, S. Yan, L. Jia, and Y. Li , A coordinated tiling and batching framework foreﬃcient gemm on gpus , in Proceedings of the 24th Symposium on Principles and Practice F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCERof Parallel Programming, 2019, pp. 229–241.[26]

S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng, and J. S. Vetter , Nvidia tensorcore programmability, performance & precision , in 2018 IEEE International Parallel andDistributed Processing Symposium Workshops (IPDPSW), IEEE, 2018, pp. 522–531.[27]

J.-P. Montagner and B. L. N. Kennett , How to reconcile body-wave and normal-modereference earth models , Geophysical Journal International, 125 (1996), pp. 229–248, https://doi.org/10.1111/j.1365-246X.1996.tb06548.x.[28]

B. Moore , Principal component analysis in linear systems: Controllability, observability, andmodel reduction , IEEE Transactions on Automatic Control, 26 (1981), pp. 17–32, https://doi.org/10.1109/TAC.1981.1102568.[29]

C. T. Mullis and R. A. Roberts , Synthesis of minimum roundoﬀ noise ﬁxed point digitalﬁlters , IEEE Transactions on Circuits and Systems, 23 (1976), pp. 551–562.[30]

E. Peise , Performance modeling and prediction for dense linear algebra , 2017, https://arxiv.org/abs/1706.01341.[31]

V. Pereyra , Model order reduction with oblique projections for large scale wave propagation ,Journal of Computational and Applied Mathematics, 295 (2016), pp. 103 – 114, https://doi.org/https://doi.org/10.1016/j.cam.2015.01.029. VIII Pan-American Workshop in Appliedand Computational Mathematics.[32]

V. Pereyra and A. Kaelin , Fast wave propagation by model order reduction , ElectronicTransactions on Numerical Analysis. Volume, 30 (2008), pp. 406–419.[33]

C. Prud’Homme, D. Rovas, K. Veroy, L. Machiels, Y. Maday, A. Patera, andG. Turinici , Reliable Real-Time Solution of Parametrized Partial Diﬀerential Equations:Reduced-Basis Output Bound Methods , J. of Fluids Eng., 124 (2001), pp. 70–80.[34]

E. V. Rabinovich, N. Y. Filipenko, and G. S. Shefel , Generalized model of seismic pulse ,Journal of Physics: Conference Series, 1015 (2018), p. 052025, https://doi.org/10.1088/1742-6596/1015/5/052025.[35]

D. Rovas, L. Machiels, and Y. Maday , Reduced-basis output bound methods for parabolicproblems , IMA Journal of Numerical Analysis, 26 (2006).[36]

D. V. Rovas , Reduced-basis output bound methods for parametrized partial diﬀerential equa-tions , PhD thesis, Massachusetts Institute of Technology, 2003.[37]

Y. Shi, U. N. Niranjan, A. Anandkumar, and C. Cecka , Tensor contractions with extendedblas kernels on cpu and gpu , in 2016 IEEE 23rd International Conference on High Perfor-mance Computing (HiPC), 2016, pp. 193–202.[38]

R. Swischuk, B. Kramer, C. Huang, and K. Willcox , Learning Physics-Based Reduced-Order Models for a Single-Injector Combustion Process , AIAA Journal, 58 (2020),pp. 2658–2672, https://doi.org/10.2514/1.J058943.[39]

F. G. Van Zee, T. Smith, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel,J. Gunnels, T. M. Low, B. Marker, L. Killough, and R. A. van de Geijn , The BLISframework: Experiments in portability , ACM Transactions on Mathematical Software, 42(2016), pp. 12:1–12:19, http://doi.acm.org/10.1145/2755561.[40]

F. G. Van Zee and R. A. van de Geijn , BLIS: A framework for rapidly instantiating BLASfunctionality , ACM Transactions on Mathematical Software, 41 (2015), pp. 14:1–14:33,http://doi.acm.org/10.1145/2764454.[41]

J. Virieux , Sh-wave propagation in heterogeneous media: velocity-stress ﬁnite-diﬀerencemethod , Exploration Geophysics, 15 (1984), pp. 265–265, https://doi.org/10.1071/EG984265a.[42]

J. Virieux , P-sv wave propagation in heterogeneous media: Velocitystress ﬁnitediﬀerencemethod , GEOPHYSICS, 51 (1986), pp. 889–901, https://doi.org/10.1190/1.1442147.[43]

Y. Wang, H. Takenaka, X. Jiang, and J. Lei , Modelling two-dimensional global seismicwave propagation in a laterally heterogeneous whole-Moon model , Geophysical JournalInternational, 192 (2012), pp. 1271–1287, https://doi.org/10.1093/gji/ggs094.[44]

K. Willcox and J. Peraire , Balanced model reduction via the proper orthogonal decomposi-tion , AIAA Journal, 40 (2002), pp. 2323–2330.[45]

D. A. Wilson , Optimum solution of model-reduction problem , Proceedings of the Institutionof Electrical Engineers, 117 (1970), pp. 1161–1165.[46]

H. Yang and M. Gunzburger , Algorithms and analyses for stochastic optimization forturbofan noise reduction using parallel reduced-order modeling , Computer Methods inApplied Mechanics and Engineering, 319 (2017), pp. 217 – 239, https://doi.org/https://doi.org/10.1016/j.cma.2017.02.030.[47]

D. Yuen, J. Wang, L. Johnsson, C.-H. Chi, and Y. Shi , GPU Solutions to Multi-ScaleProblems in Science and Engineering , Springer Publishing Company, Inc., 1st ed., 2011.

UPPLEMENTARY MATERIALS: A COMPUTE-BOUNDFORMULATION OF GALERKIN MODEL REDUCTION FORLINEAR TIME-INVARIANT DYNAMICAL SYSTEMS ∗ FRANCESCO RIZZI † , ERIC J. PARISH ‡ , PATRICK J. BLONIGAN ‡§ , AND

JOHN TENCER §¶ SM1. Problem setup for the FOM scaling results.

To run the FOM scalingtests presented in § ρ = 2500 (g/cm ) and v s = 5000 (m/s). The source signal acts at 640 (km)of depth and takes the form of the ﬁrst derivative of a Gaussian. For the Rank1FOM,the source has a period T = 60 (seconds) and delay 180 (seconds), while for theRank2FOM each forcing term has the same Gaussian form but with a period randomlysampled over the interval (60, 80). The numerical dispersion condition is satisﬁed forall problems sizes. The time step size is ∆ t = 0 .

05 such that numerical stability isensured for all problem sizes considered, and each run completes 1000 time steps,which allows us to collect meaningful timing statistics of the computational kernels.

SM2. When should an analyst prefer the Rank2Fom?

In this section, wepresent for the FOM an analysis similar to the one done for the ROM in § § τ ( N, n, M ) represent the runtimeto complete a single

FOM simulation of the form Eq.(3.2) with N total degrees offreedom using n threads and a given value M . It follows that the total runtime tocomplete trajectories for P forcing realizations with a budget of 36 threads can beexpressed as(SM2.1) τ P ( N, n, M ) = τ ( N, n, M ) P n M , because n is the number of independent runs executing in parallel on the node witheach run responsible of computing M trajectories. We can deﬁne the following metric(SM2.2) s ( N, n, M ) = τ P ( N, , τ P ( N, n, M ) = τ ( N, , τ ( N, n, M ) 2

Mn , ∗ Submitted to the editors September 5, 2020 † NexGen Analytics, 30N Gould St. Ste 5912, Sheridan, WY, 82801, USA ([email protected], [email protected]). ‡ Department of Extreme-Scale Data Science and Analytics, Sandia National Laboratories, Liver-more, CA, USA ([email protected], [email protected]) § These authors contributed equally to this work. ¶ Department of Thermal Sciences & Engineering, Sandia National Laboratories, Albuquerque,NM, USA ([email protected]) SM1 a r X i v : . [ phy s i c s . c o m p - ph ] S e p M2 F.RIZZI, E.J.PARISH, P.J.BLONIGAN, AND J.TENCER (a) (b)(c) (d)

Fig. SM1 . Heatmap visualization of s ( N, n, M ) (see Eq. SM2.2) computed for N = 785152 (a), (b), (c) and (c), and various values of M and n . On each plot, thesolid black line separates the cases where s ( N, n, M ) > (i.e., where Rank2FOM is more convenient)from those where s ( N, n, M ) < (i.e., where Rank1FOM is more convenient). The white region ineach plot identiﬁes the cases violating the constraints listed in § SM2. Note that we set the colorbarwith the same limits used in § where s ( N, n, M ) > s ( N, n, M ) <

1. We remark that the reference case used herefor this metric is the Rank1FOM using n = 2 threads, since for the FOM we neveruse less than 2 threads, see § s ( N, n, M ) computed for variousvalues of M and n , and N ∈ { , , , } . To generatethese plots, we used the FOM runtimes obtained in § M problem size, using M = 4 and n = 4 is39% more eﬃcient than the case M = 1 , n = 2, and becomes 61% more eﬃcient if weconsider the 50 M problem size. For M = 16 and n = 12, the gain is about 30% forthe 3 M case and becomes 44% for the 50 M one. This trend suggests that the gainincreases with the problem size. It is interesting to note that the gains are overallquite small, never exceeding 61%. This is due to the fact that while the Rank2FOMhas a higher compute intensity than Rank1FOM, they are both memory-bandwidthbound problems, and therefore they are both limited. This is in contrast to whatwe observed for the ROM in § UPPLEMENTARY MATERIALS: COMPUTE-BOUND GALERKIN ROM OF LTI SYSTEMS

SM3

SM3. Repositories.

The code developed for this work and data used in thepaper are publicly available at: • Code repository: https://github.com/fnrizzi/ElasticShearWaves • The full dataset: https://github.com/fnrizzi/RomLTIData