[PDF] Alternating direction implicit time integrations for finite difference acoustic wave propagation: Parallelization and convergence

Abstract

This work studies the parallelization and empirical convergence of two finite difference acoustic wave propagation methods on 2-D rectangular grids, that use the same alternating direction implicit (ADI) time integration. This ADI integration is based on a second-order implicit Crank-Nicolson temporal discretization that is factored out by a Peaceman-Rachford decomposition of the time and space equation terms. In space, these methods highly diverge and apply different fourth-order accurate differentiation techniques. The first method uses compact finite differences (CFD) on nodal meshes that requires solving tridiagonal linear systems along each grid line, while the second one employs staggered-grid mimetic finite differences (MFD). For each method, we implement three parallel versions: (i) a multithreaded code in Octave, (ii) a C++ code that exploits OpenMP loop parallelization, and (iii) a CUDA kernel for a NVIDIA GTX 960 Maxwell card. In these implementations, the main source of parallelism is the simultaneous ADI updating of each wave field matrix, either column-wise or row-wise, according to the differentiation direction. In our numerical applications, the highest performances are displayed by the CFD and MFD CUDA codes that achieve speedups of 7.21x and 15.81x, respectively, relative to their C++ sequential counterparts with optimal compilation flags. Our test cases also allow to assess the numerical convergence and accuracy of both methods. In a problem with exact harmonic solution, both methods exhibit convergence rates close to 4 and the MDF accuracy is practically higher. Alternatively, both convergences decay to second order on smooth problems with severe gradients at boundaries, and the MDF rates degrade in highly-resolved grids leading to larger inaccuracies. This transition of empirical convergences agrees with the nominal truncation errors in space and time.

Full PDF

AAlternating direction implicit time integrations for ﬁnite di ﬀ erence acoustic wavepropagation: Parallelization and convergence B. Otero a , O. Rojas b,c , F. Moya d , J. Castillo e a Dpto. d’Arquitectura de Computadors, Universitat Polit`ecnica de Catalunya, Barcelona, Spain b Barcelona Supercomputing Center, Barcelona, Spain c Escuela de Computaci´on, Facultad de Ciencias, Universidad Central de Venezuela, Caracas, Venezuela d Escola T´ecnica Superior d’Enginyeria de Telecomunicaci´o de Barcelona, Universitat Polit`ecnica de Catalunya, Barcelona, Spain e Computational Science Research Center, San Diego State University, San Diego CA, EEUU

Abstract

This work studies the parallelization and empirical convergence of two ﬁnite di ﬀ erence acoustic wave propagationmethods on 2-D rectangular grids, that use the same alternating direction implicit (ADI) time integration. ThisADI integration is based on a second-order implicit Crank-Nicolson temporal discretization that is factored out bya Peaceman-Rachford decomposition of the time and space equation terms. In space, these methods highly divergeand apply di ﬀ erent fourth-order accurate di ﬀ erentiation techniques. The ﬁrst method uses compact ﬁnite di ﬀ erences(CFD) on nodal meshes that requires solving tridiagonal linear systems along each grid line, while the second oneemploys staggered-grid mimetic ﬁnite di ﬀ erences (MFD). For each method, we implement three parallel versions: (i)a multithreaded code in Octave, (ii) a C ++ code that exploits OpenMP loop parallelization, and (iii) a CUDA kernelfor a NVIDIA GTX 960 Maxwell card. In these implementations, the main source of parallelism is the simultaneousADI updating of each wave ﬁeld matrix, either column-wise or row-wise, according to the di ﬀ erentiation direction. Inour numerical applications, the highest performances are displayed by the CFD and MFD CUDA codes that achievespeedups of 7.21x and 15.81x, respectively, relative to their C ++ sequential counterparts with optimal compilationﬂags. Our test cases also allow to assess the numerical convergence and accuracy of both methods. In a problemwith exact harmonic solution, both methods exhibit convergence rates close to 4 and the MDF accuracy is practicallyhigher. Alternatively, both convergences decay to second order on smooth problems with severe gradients at bound-aries, and the MDF rates degrade in highly-resolved grids leading to larger inaccuracies. This transition of empiricalconvergences agrees with the nominal truncation errors in space and time. Keywords:

CUDA and OpenMP programming, ADI, Compact Finite Di ﬀ erences, Mimetic Operators.

1. Introduction

Finite di ﬀ erence (FD) modeling methods have been actively applied to seismic wave propagation problems onthe last ﬁfty years. In this area, advanced methods use staggered grids (SG) in combination to fourth- and higher-order FD formulas to resolve the wave equation in ﬁrst order formulation, and then solving for all dependent vari-ables [2], [17], [19], [26], [35], [38] [41], [42]. In particular, the velocity-pressure formulation of the acoustic waveequation allows stating free surface and rigid wall boundary conditions as simple Dirichlet conditions, and is also wellsuited for the implementation of absorbing boundaries. In a SG, discrete wave ﬁelds may be deﬁned on separate nodalmeshes, which are displaced by the half of grid spacing h in one or more coordinate directions. On these grids, anunknown wave ﬁeld is placed at the center of those it depends, and thus SG di ﬀ erentiation yields more accuracy incomparison to a traditional nodal FD stencil with an error dependent on full h . Email addresses: [email protected] (B. Otero), [email protected] (O. Rojas), [email protected] (F. Moya), [email protected] (J. Castillo)

Preprint submitted to Elsevier June 16, 2020 a r X i v : . [ m a t h . NA ] J un D SG methods mentioned above, among many other wave propagation schemes, employ explicit second-orderLeapfrog time integrations that demand minimal storage for wave ﬁeld histories, but strongly limit time stepping toenforce stability. Multi-step explicit time integrations with third- and higher-order accuracy have been progressivelyexplored in search for wider Courant-Friedrichs-Lewy (CFL) stability ranges, while reducing numerical dispersionand other anomalies [3], [4], [43], [51],[52]. On the other hand, alternating direction implicit (ADI) methods applya semi-implicit integration to the wave equation that allows for large time stepping, and only involves previouslycomputed wave ﬁelds. Thus, ADI methods are one-step and low memory demanding. These methods were developedby Peaceman and Rachford [34] to solve parabolic equations, and then used in hyperbolic problems [8], [21], [30].Since then, several ADI schemes have been focused on solving the acoustic wave equation by using compact ﬁnitedi ﬀ erences (CFD) to discretize the spatial derivatives. Recent applications are [1], [13], [14], [22], [24], and [28]. ACFD formula linearly couples discrete evaluations of a function and its derivative of certain order at subsequent gridpoints, and requires solving a banded systems of linear equations (SEL) for the latter values along a whole grid line.Given this formulation, a CFD stencil presents a shorter length compared to a traditional FD with similar accuracy,and this property shorten the linear system bandwidth. For instance, a set of fourth order CFD can be applied throughthe solution of a tridiagonal system via a fast solver like Thomas’ algorithm (TA). The literature on CFD is broad,but the seminal paper by Lele [25] presents Taylor-based constructions of central and lateral formulas of several orderof accuracy, and their optimization for maximum resolution and low dispersion on one-way advection problems.More recent contributions on CFD can be found on cited works above. Notice that ADI CFD methods have shown apotential for fast and accurate wave simulations, combining the large ADI CFL condition with the CFD low-dispersivepropagation modeling.Time explicit FD wave propagation methods based on local spatial stencils are well suited for domain decomposi-tion and code parallelization. Main reasons are their formulation on structured meshes and their low data dependencyamong grid sub-domains. Particularly, implementations on Graphics Processing Units (GPU) have become useful andpowerful wave simulation tools in 2-D and 3-D, given that realistic applications may result computationally intensiveeven in two spatial dimensions [31], [33], [39], [47], [49], [54]. Conventional ADI methods and CFD schemesaddressing for high computational e ﬃ ciency must rely on fast SEL solvers. In the case of parallel implementations,SEL solvers must be amenable to parallelism and e ﬃ ciently deal with data dependencies and scalability issues. OnGPUs, most of ADI methods have been developed for parabolic equations as those found in [15], [46], [50], and [53].Alternatively, CFD implementations on GPU have been mostly assembled for computational ﬂuid dynamics based onsolving the Navier-Stokes equations, as given by [16], [29], and [45]. All these implementations with fourth-orderspatial accuracy have adopted the Cyclic Reduction (CR) tridiagonal solver, and apply hardware-dependent special-izations and optimizations for better performance. However, hybrid CR-TA solvers have been also proposed in [18]and [23]. In addition, solution methods based on approximate matrix inversion have been alternatively used by [48],and Authors in [11] and [12] developed a problem-based solution approach. Thus, last word on GPU tridiagonalsolvers has not yet been said, and this motivates recent the reviews and exploratory studies in [20] and [27].This work builds on the contribution of Cordova et.al [10]. Grid types, FD spatial discretizations, and the CN-ADI time integration used by the former methods in [10] are the same that support our current numerical schemes.However, latter schemes di ﬀ er in two main numerical aspects. First, our new CFD nodal method employs exclusivelyfourth order compact stencils at all grid points, as given by Lele in [25] and summarized in Appendix A, whereas theformer method reduces accuracy to third order at boundary closures. On wave problems exhibiting boundary layers,or any source of strong gradients near boundaries, the fully fourth order method may perform more precisely. Second,the former mimetic SG method, although globally fourth order accurate, presents a formulation in terms of compact-like operators that apply di ﬀ erentiation in two steps. This formulation feature has no advantage at the computationlevel, so we here reformulate this SG method in terms of standard mimetic operators, originally presented in [5]-[6]and detailed in Appendix B. From the implementation point of view, this work undertakes the parallelization of bothnew schemes on multi-core (CPU) and many-core (GPU) architectures, and evaluate their performance to measurelimiting speedup. It is worthy of mention that the literature on parallel ADI FD methods for wave propagation inacoustic or more general media is rather scarce, and we have not found schemes based on CFD or MDF, rather thanthe pioneer implementations by Moya F. in [32]. This deﬁciency highly motivates this work.The remainder of this paper is organized in the following way. Section 2 brieﬂy presents the formulation of theCFD and mimetic numerical methods for acoustic wave motion, along with the CPU complexity analysis of associatedpseudo-codes. In Section 3, we describe the progressive code optimization and parallelization of our methods to2mprove computational performance in available CPU and GPU scenarios. Section 4 states the numerical tests usedto assess accuracy and convergence of these schemes, and to compare the speedup of execution times achieved by alltheir parallel implementations in Section 5. Our main conclusions are ﬁnally given in Section 6.

2. Numerical methods

In a medium Ω with density ρ and adiabatic compression κ , the propagation of acoustic waves obyes the followingsystem of non-homogeneous di ﬀ erential equations  ∂ u ( (cid:126) x , t ) ∂ t = − κ ∇ · (cid:126) v ( (cid:126) x , t ) + f ( (cid:126) x , t ) ,ρ ∂(cid:126) v ( (cid:126) x , t ) ∂ t = −∇ u ( (cid:126) x , t ) , (cid:126) x ∈ Ω , t ≥ . (1)Above, the dependent variables are the scalar pressure u and the particle velocity vector (cid:126) v = ( v , w ) , while f maycorrespond to an additional forcing term. Appropriate initial and boundary conditions must be considered in order toﬁnd a particular solution to (1), and this velocity-pressure formulation is well suited to state boundary conditions ofrelevance in seismic applications. For instance, a free surface can be modeled by a homogeneous Dirichlet conditionon u , and most absorbing conditions require the computation of u along the domain boundaries. However, we presentthe following formulation of our numerical methods on the square domain Ω = [0 , × [0 , u ( x , y , t ) = u ( x , y , t ) is imposed at ( x , y ) ∈ ∂ Ω , and velocities at those edges must be computed during thesimulation. Brief comments on the easy reformulation that also computes boundary values of u are appropriatelygiven nonetheless. Initial distributions of all three dependent variables are given at t = To simplify notation, we use a square N × N lattice to discretize Ω for a given integer N , and then cell sides havea common length h = N . Grid values of the continuous ﬁelds u , v , and w are hold by matrices U , V , W , respectively.The result of FD di ﬀ erentiation of discrete pressures are collocated on matrices U x and U y , according to the directionof di ﬀ erentiation, and similar structures are used in the case of particle velocities. Using the CFD operators P and Q presented in Appendix A, pressure di ﬀ erentiation at all grid nodes is computed by the matrix operations U x P T = UQ T , PU y = QU . However, pressure is known along boundary edges because of Dirichlet conditions, thus this formulation also makesuse of the reduced matrix ¯ U that only holds interior grid values. Time updating of these inner discrete pressures isbased on the discretization of momentum conservation in (1), and requires approximations to v x and w y . In space,this discretization in carried out by reduced operators ¯ P and ¯ Q , also introduced in Appendix A, applied to discretevelocities ¯ V x ¯ P T = ¯ V ¯ Q T , ¯ P ¯ W y = ¯ Q ¯ W . (2)Reduced matrices ¯ V and ¯ W are obtained from whole matrices after removing ﬁrst and last rows of V , and ﬁrst and lastcolumns of W , respectively. By doing so, we avoid unnecessary velocity di ﬀ erentiation along boundaries.The application of the Alternate Direction Implicit (ADI) formulation to the acoustic model (1) starts by deﬁningthe discrete vector operators A h  ¯ UVW  = −  κ ¯ V ¯ Q T ( ¯ P T ) − ρ − UQ T ( P T ) −  , A h  ¯ UVW  = −  κ ¯ P − ¯ Q ¯ W ρ − P − QU  , (3)Next, the Crank-Nicolson discretization of time derivatives leads to the fully implicit method (cid:32) I − ∆ t A h − ∆ t A h (cid:33) U m + = (cid:32) I + ∆ t A h + ∆ t A h (cid:33) U mCN + ∆ t (cid:16) A mF + A m + (cid:17) , (4)3here ∆ t is the time step, U CN represents the discrete solution [ ¯ U , V , W ] T at a given discrete time, and A F = [ F , , T collects the source contribution f at grid interior. Coupled equations of this sort are e ﬃ ciently solved after theapplication of an ADI splitting, that only requires the solution of banded SEL for CFD di ﬀ erentiation along a singlecoordinate direction. The Peaceman-Rachford ADI decomposition of (4), as used in [10], states a calculation in twostages (cid:32) I − ∆ t A h (cid:33) U ∗ CN = (cid:32) I + ∆ t A h (cid:33) U mCN + ∆ t A F m (5) (cid:32) I − ∆ t A h (cid:33) U m + = (cid:32) I + ∆ t A h (cid:33) U ∗ CN + ∆ t A F m + , (6)each of which alternates the application of operators A h and A h . The intermediate approximation U ∗ CN is commonon these ADI formulations, but only approximations at time levels t = m ∆ t and t = ( m + ∆ t are consistent solutionsto (1). According to the deﬁnition of A h , the ﬁrst stage allows an explicit calculation of vertical velocity W ∗ in termsof its previous values at time t = m ∆ t , and the di ﬀ erentiation of pressure U m along the y direction. However, this stagealso couples the calculation (row by row) of intermediate ¯ U ∗ and V ∗ in the following two sets of linear systems  ¯ U ∗ ¯ P T + ∆ t κ ¯ V ∗ ¯ Q T = ¯ U m ¯ P T − ∆ t (cid:16) κ ¯ P − ¯ QW m ¯ P T (cid:17) + ∆ t F m ¯ P T V ∗ P T + ∆ t ρ − U ∗ Q T = V m P T . (7)We denote the matrices on the right hand side of above equations by A and B , respectively, which can be computedfrom known solutions at t = m ∆ t . That is, A = ¯ U m ¯ P T − ∆ t (cid:16) κ ¯ P − ¯ Q ¯ W m − F m (cid:17) ¯ P T and B = V m P T . Next, rows ofintermediate pressures and horizontal velocities can be computed by forward and backward substitutions from the LUdecomposition of tridiagonal P and ¯ P , on the following uncoupled approximation.  ¯ U ∗ k + ¯ P T = A − ∆ t κ ¯ V ∗ k ¯ Q T V ∗ k + P T = B − ∆ t ρ − U ∗ k + Q T , V ∗ = V m . (8)This ﬁxed point iteration stops when ( || ¯ U ∗ k + − ¯ U ∗ k || < ε ) and ( || V ∗ k + − V ∗ k || < ε ), for a given tolerance ε > t = ( m + ∆ t through asimilar procedure. The vertical velocity V m + is computed explicitly using its intermediate value V ∗ and the approx-imation U ∗ x , according to the deﬁnition of A h . The columns of pressure U m + and velocity W m + are successivelycomputed by the ﬁxed point ADI iteration  ¯ PU m + k + = C − ∆ t κ ¯ Q ¯ W m + k PW m + k + = D − ∆ t ρ − QU m + k + , W m + = W m . (9)where matrices C = ¯ P ¯ U ∗ − ∆ t P (cid:16) κ ¯ V ∗ ¯ Q T ( ¯ P T ) − − F m + (cid:17) and D = PW ∗ are obtained from intermediate variables, asone can see. In above formulation, matrix inversion is only used for notation convenience in auxiliary matrices A and C , but LU is actually employed to resolve the 2 N linear systems involved in their calculations. As mentionedearlier, the additional 2 N linear systems embedded into each ﬁxed-point iteration in (8) and (9), are also solved byforward / backward substitutions, using the previous LU factorization of P and ¯ P .The above CDF formulation is amenable for including absorbing damping layers that require the computation ofdiscrete wave ﬁelds along boundaries. By simply replacing reduced operators ¯ P and ¯ Q , and reduced matrices ¯ U , ¯ V ,and ¯ W , by their full matrix versions in previous formulation steps as convenient, the boundary wave ﬁelds can be also4btained. These values can be further damped by the absorbing technique proposed in [7]. Even more, the versatilestrategy developed in [44] for the damped wave equation can be also adopted after a straighforward reformulation ofour CFD method.A high-level implementation of above nodal CFD method is sketched in algorithm 1. This algorithm presents twosequential inner loops in lines 11 and 15, corresponding to the two ADI stages that ﬁnally calculate ( ¯ U , V , W ) m + fromdiscrete solutions at t = m ∆ t . For clarity, each of these stages is broken down into a separate coding procedure givenin Appendix C. At each of these procedures, there is no data dependency among the N linear systems solved, eitherrow wise or column wise. Thus, each ADI set of N tridiagonal systems can be solved simultaneously by our parallelmethods. Algorithm 1

The nodal CFD method

Require: U o , V o , W o ∈ R ( N + × ( N + Ensure: U ∈ R ( N + × ( N + , V ∈ R ( N + × ( N + , W ∈ R ( N + × ( N + procedure N odal ( U o , V o , W o , cﬂ , T sim ) ∆ t ← cflNc max , I ∆ t = T sim ∆ t U ← U o (cid:46) Initialization V ← V o W ← W o ( P , Q , ¯P , ¯Q ) ← Nodal CFD Operators ( N ) for m ∈ . . . I ∆ t do A ← ( U m (¯: , ¯:) − ∆ t ( K . ∗ ¯P \ ( QW m (: , ¯:)) − F m ) ¯P T (cid:46) A Computation B ← V m (¯: , :) P T W ∗ (: , ¯:) ← W m (: , ¯:) − ∆ t R . ∗ ( P \ ¯QU m (¯: , ¯:)) (cid:16) ¯U , ¯V (cid:17) ∗ ← ADI- rows ( U m , V m , A , B ) C ← ¯P ( U ∗ (¯: , ¯:) − ∆ t ( K . ∗ ( V ∗ (¯: , :) ¯Q T ) \ ¯P T − F m + ) (cid:46) C Computation D ← ¯PW ∗ (: , ¯:) V m + (¯: , :) ← V ∗ (¯: , :) − ∆ t R . ∗ ( U ∗ (¯: , ¯:) Q T / P T ) (cid:16) ¯U m + , ¯W m + (cid:17) ← ADI- columns ( U ∗ , W ∗ , C , D ) t ← t + ∆ t end for return U , V , W end procedure The algorithm 1 employs the following parameters and matrices: • U , V , W : Initial conditions for pressure and velocities on the square N × N lattice that discretizes the region[0 , × [0 , • K , R : Matrices of grid values of material properties κ and ρ − , respectively. • c f l , c max : CFL stability parameter and the maximum value of wave speed c = (cid:112) κ/ρ . • T sim : Physical time of the global simulation.Moreover, we use some special index notations in this pseudo-code. Matrix ¯ X represents the reduced form of matrix X according to the formulations given in sections 2.1 and 2.2. Matrix indexing by : refers to every element along thecorresponding dimension, while ¯: only denotes the interior elements 2 , , . . . , N −

1. Finally, . ∗ represents the elementwise vector / matrix multiplication, as used in Octave and MATLAB. As expected, the computational cost of this method is driven by the solution of the ADI embedded SEL. LU hasbeen traditionally employed for solving tridiagonal linear systems due to its computational speed, in addition to a highnumerical stability for diagonally dominant matrices (i.e., the Gaussian elimination can proceed with no pivoting).The algorithm only performs 2 N steps to solve a given system, that results in an algorithmic complexity of O ( N ). In5his work, we also omit any pivoting strategy, even though CFD matrices P and ¯ P , are not strictly diagonally dominant.Numerical results do not reﬂect any instability signs arising from this restraint.The operation complexity depends on the grid density N , that sets the dimension of vectors and matrices used bythe nodal CFD method. In case of reduced matrices ¯ P , ¯ Q , ¯ U , and others, N − ≈ N , for N >>

1. The total numberof time iterations simply results I ∆ t = N c max T sim c f l , where T sim is the physical simulation time. In this analysis, let usconsider the matrix operations in Table 1. Table 1: Computational order of matrix operations carried out by both the nodal CFD and the mimetic SG methods

Operation Data type Computational OrdercostAssignment Dense matrix T = N Addition / Substraction Dense matrix T + N Product Scalar x Dense matrix T λ N Sparse x Sparse matrices T S S N Dense x Sparse matrices T XS N SEL solutions by LU N tridiagonal systems T S EL N Frobenius norm Dense matrix T N N Table 2 lists the amount of matrix operations performed by the nodal CFD method during I ∆ t time iterations. Fromthere, one can easily see that the time complexity of this scheme is of polynomial order, and highly dominated by therow wise and column wise ADI cycles (see statements 11 and 15, in the algorithm 1). In particular, the complexitydominant term is T CFD ≈ ∗ I ∆ t ∗ k max ∗ N = O ( N ), where k max is the maximum number of iterations spent by eachADI stage to converge. In all numerical experiments carried out on this work, k max varies from 6 to 8, and thus weexpect k max being of O (1) in similar wave propagation applications. Table 2: Total amount of matrix operations carried out by the nodal CFD method

Section Maximum T = T + T λ T S S T XS T S EL T N iterationsInitialization 1 5 A computation I ∆ t W ∗ computation I ∆ t I ∆ t I ∆ t ∗ k max C computation I ∆ t V m + computation I ∆ t I ∆ t I ∆ t ∗ k max Along our optimization and parallelization process, we implement and test slightly di ﬀ erent versions of the ADIembedded iterations (8) and (9), in search for a higher level of parallelism. In particular, matrix V ∗ k + might becomputed from pressure U ∗ k in the second step of (8), so this calculation can proceeds simultaneously with the ﬁrststep that updates U ∗ k + . A similar strategy can be applied in (9) for a simultaneous computation of U m + k + and W m + k + ,making this ADI formulation more suitable for parallel platforms. This simple idea is similar to the one behindthe iterative Jacobi linear solver, fully parallelizable in comparison to its Gauss Seidel counterpart. However, thisJacobi-like ADI actually requires few more iterations for convergence, than the above implementation that delays V ∗ k + computation until U ∗ k + is available. In addition, this ADI choice also leads to a reduction of the CFL stabilitylimit of this nodal method around 90%, and this deﬁciency is even higher in the case of the mimetic SG method. Thus,we discard this implementation choice in current developments.6 .2. The staggered MFD method In this formulation, the grid distribution of discrete waveﬁelds takes place on a rectangular staggered grid. Inanalogy with the nodal mesh used by the previous CFD method, let us consider the 1-D grid of N cells of length h along the x direction, with nodes x i = i ∗ h , and cell centers x i + / = x i + x i + for i = , · · · , N −

1. These gridnodes are collected in vector X n , while X cb is another vector comprising cell centers in addition to grid edges, thatis X cb = (0 , x , ..., x N − , y direction, the staggered distribution ofdiscrete pressures correspond to the mesh locations given by X cb ⊗ Y cb , where ⊗ represents the standard cartesianproduct operator used in set theory. Alternatively, grid sites of discrete horizontal and vertical velocities are givenby X n ⊗ Y cb and X cb ⊗ Y n , respectively. According to these staggered distributions, matrix dimensions of discretewaveﬁelds are U ∈ R ( N + × ( N + , V ∈ R ( N + × ( N + and W ∈ R ( N + × N , in contrast to the previous nodal formulationwhere all these matrices are ( N + × ( N +

1) square. Due to the boundary conditions, we again make use of reducedmatrices ¯ V and ¯ W , obtained from original V and W after removing rows from the former and columns from the latter,as before.The staggered FD di ﬀ erentiation employs the mimetic fourth-order operators G and D given in Appendix B, foran approximation of partial derivatives along both directions of all three waveﬁelds U x = UG T , U y = G U (10)¯ V x = ¯ V D T , ¯ W y = D ¯ W . (11)Next, we replace the nodal CFD deﬁnition of discrete vector operators A h and A h by the following staggered ap-proach A h  ¯ UVW  = −  ¯ V D T UG T  , A h  ¯ UVW  = −  D ¯ W G U  . (12)With above deﬁnitions at hand, the formulation of this staggered MFD method proceeds by replicating the timediscretization and Peaceman-Rachford two-stage ADI decomposition, applied in the previous section. A pseudo-codeof this method is given by the algorithm 2. For completeness, the two ADI stages are broken down into separateprocedures, and given by the algorithm 4 in Appendix C. The algorithm 2 employs the same parameters used by thenodal algorithm 1, as well as similar matrix and indexing notation. The complexity analysis of this method considers the same matrix operations deﬁned in Table 1. The new Table3 lists the total amount of these operations performed by the staggered method during I ∆ t time iterations. Similarto the nodal scheme, the operational cost is a N -dependent polynomial, with a leading term given by the solutionof the ADI stages (referred in statements 11 and 15 of the algorithm 2), that we denote as T MFD . Speciﬁcally, T MFD ≈ ∗ I ∆ t ∗ k max ∗ N = O ( N ). Thus, the total amount of operations is less than those employed by the nodalmethod. However, the tridiagonal structure of embedded linear systems on the former method, and our LU basedsolution strategy, make equivalent the operational complexity of both methods.

3. Computational optimization

The implementation of the methods presented in Section 2 takes place on an Intel(R) Core(TM) i7-4770 CPU with4 cores running at 3 . .

04 LTS operating system.

Original codes of the proposed methods were developed in Octave and used the LU decomposition of CFD ma-trices P and ¯ P to solve ADI embedded SEL [10]. To better use the capabilities of our CPU architecture and apply7 lgorithm 2 The staggered MFD method

Require: U o ∈ R ( N + × ( N + , V o ∈ R ( N + × N , W o ∈ R N × ( N + Ensure: U ∈ R ( N + × ( N + , V ∈ R ( N + × N , W ∈ R N × ( N + procedure S taggered ( U o , V o , W o , cﬂ ) ∆ t ← cflNc max , I ∆ t = T sim ∆ t U ← U o (cid:46) Initialization V ← V o W ← W o ( D , G ) ← S taggered MFD Operators ( N ) for m ∈ . . . I ∆ t do A ← U m (¯: , ¯:) − ∆ t ( K . ∗ D W m (: , ¯:) − F m ) (cid:46) A Computation B ← V m (¯: , :) W ∗ (: , ¯:) ← W m (: , ¯:) − ∆ t R . ∗ ( G U m (¯: , ¯:)) (cid:16) ¯U , ¯V (cid:17) ∗ ← ADI- rows ( U m , V m , A , B ) C ← U ∗ (¯: , ¯:) − ∆ t ( K . ∗ V ∗ (¯: , :) D T − F m + ) (cid:46) C Computation D ← W ∗ (: , ¯:) V m + (¯: , :) ← V ∗ (¯: , :) − ∆ t R . ∗ ( U ∗ (¯: , ¯:) G T ) (cid:16) ¯U m + , ¯W m + (cid:17) ← ADI- columns ( U ∗ , W ∗ , C , D ) t ← t + ∆ t end for return U , V , W end procedure Table 3: Total amount of matrix operations carried out by the staggered MFD method

Section Maximum T = T + T λ T S S T XS T S EL T N iterationsInitialization 1 5 2 A computation I ∆ t W ∗ computation I ∆ t I ∆ t I ∆ t ∗ k max C computation I ∆ t V m + computation I ∆ t I ∆ t I ∆ t ∗ k max ++ , and use the 4 . . − gcc compiler in this process.The ﬁrst optimization step is based on using di ﬀ erent compilation ﬂags, among which the − O ++ codes, leading to an immediate performance improvement. This − O ﬀ ects to the new CFD code. Theoriginal code in [10] solved the inherent SEL’s in a vector oriented fashion, i.e. matrix updates and system solutionsare performed a row / column at a time. To take better performance advantage of the algebra libraries, a matrix orientedapproach is taken for the C ++ implementation. That is, updates and system solutions proceed simultaneously on thefull matrix. It can be seen that most of vector operations in both algorithms CFD and MDF decouple, so they can beperformed as a single matrix operation. In particular, the CFD SEL’s with multiple right hand sides and same matrix,can be solved simultaneously. 8 .1.2. Parallel versions For multiprocessing, we use the Octave parallel package to distribute computation along rows / columns amongseveral processes. At ﬁrst, one process is used for each row / column, but this solution performs extremely bad. Littlework is assigned to each process, resulting in most of computing power being wasted in managing these processesand their communication. A second attempt consists of using a small number of processes (up to 8 in an eight coreCPU), and then lists of columns / rows are assigned to each of these processes. Performance results are slightly better,however the single-threaded implementation is still faster for all tested numerical meshes, with the grid of biggestsize ( N = / synchronization among processes in this parallel package.Due to results discussed above, we return to the original vector orientation of this problem, as it parallelizesnaturally thanks to the decoupled operations at the vector level. The partition proceeds at the row / column level,distributing each vector in a thread pool that performs the required operations. Our C ++ implementations follow andexploit this data orientation. The next optimization step applied to these vector-oriented versions is loop parallelizationusing OpenMP. OpenMP allows us to take full advantage of Intel multi-core CPU architectures. In our OpenMP loopparallelization, we have taken into account two intrinsic aspects of our numerical discretization: (i) data dependencies,and (ii) stencil dependencies. Each parallel loop collects a common set of updating instructions applied on a particulargrid zone (boundary, near boundary, or interior nodes). Scheduling of OpenMP threads is established by the ﬂag auto,which invokes as many threads as the number of available CPU cores. We observe in our experiments that this optionprovides the best performance for a wide range of scenarios. Increasing the number of threads over the number ofcores usually entails an overhead due to communication conﬂicts, whereas using less threads than available coresmight lead to an underuse of a multi-core platform. Once CPU parallelization has been appropriately exploited, we address a new GPU implementation of our numer-ical schemes. On the available architecture, the programming language is CUDA C based on the nvcc compiler. Inthis migration process of our original sequential codes, we also use the Toolkit 7 . .

17 and functions of the cuBLASlibrary, and only develop speciﬁc kernels for particular functionalities not supported by cuBLAS.We next comment on crucial implementation details. Due to the relatively small sizes of the matrices involved inour test cases (up to 1024 × / columnlevel, we focus on optimizing the matrix operations themselves at the element level. This allows achieving a muchhigher level of parallelism, given that parallelization is performed throughout the full extent of the algorithm, and notonly at the row / column solvers. .As already mentioned, we implement some missing banded matrix functionalities in cuBLAS. For instance, theLevel 3 (L3) product of a banded matrix times a dense matrix is not provided, while the Level 2 (L2) multiplication of abanded matrix times a vector is indeed available. We build the L3 operation over the L2 one by decomposing a matrix-matrix multiplication into independent matrix-vector products and accumulation operations. Our ﬁrst approach, usingstreams to overlap the matrix-vector products, achieves a very low occupancy due to the cost of kernel launching,relative to the workload of each thread. We found that best results are obtained when the whole matrix is computed ina single launch, as the work per thread increases. Here, a higher number of threads are e ﬀ ectively launched, increasingthe opportunities to obtain better occupancy. Same results are achieved by using stream-based distances and matrixnorm computations.On the other hand, our GPU implementations need to apply reduction operations, like for instance, those involvedin matrix norm computations. In this case, parallelization of the reduction process is performed in a hierarchicalfashion. Each thread adds a pair of elements, and then half of the threads add the partial results computed. This processis repeated for log ( N ) iterations, until the whole list is reduced to a single value. Several practical considerations mustbe considered to achieve optimal performance: minimal divergence, array indexing to optimize data locality, unrolling In order to preserve generality of our implementations, we do not unroll the computation chain. This could yield much higher speedups at theexpense of problem speciﬁcity

9f some operations, etc. Bigger reductions are computed in two steps. First, one reduces in parallel all the columns ofthe matrix to a vector in shared memory. Then, this vector is reduced again to a single value. Here, we have used thisreduction technique to calculate the matrix Frobenius norm in our new kernels, as required to test ADI convergenceaccording to algorithms in Appendix C.Finally, we want to notice that the stopping criterion of ADI iterations in algorithms 3 and 4 of Appendix C,imposes an important constraint for GPU performance. The number of steps required until convergence is actuallylimited by the constant parameter k max , that varies from 6 to 8, in our experimental set. However, convergence onthe Frobenius norm of matrices U , V , and W , updated by GPU processing, require that a ﬂoating point value mustbe transferred form device to host, for testing against ε tolerance. This is wasteful as the PCIe bus is slow and has ahigh latency. To circumvent this problem, we study the average number of steps until convergence. In our cases, thisnumber is roughly 6, so we thus perform this minimum number of iterations until convergence start being tested, andsaving most of the unnecessary device to host transfers.Concerning our new CUDA schemes, we want to emphasize that checking for inner ADI convergence forces thesynchronization of kernel threads, slowing down the global computation, and leading to a bottleneck that hampersperformance. As it was said, convergence is almost always achieved in a constant number of iterations for each testcase, so we opt for executing a minimum number of iterations before any checking take place.

4. Numerical tests

We next present the family of numerical tests that allow assessing accuracy and convergence properties of boththe CFD and MDF schemes. Same tests are used to evaluate their computation speedup gained through the optimiza-tion stages described previously, but this discussion is given in next section. Under appropriate initial and Dirichletboundary conditions (BCs), the propagation model (1) has the following exact solution u ( x , y , t ) = (cid:34) Γ (cid:16) x k − (1 − x ) k + y k − (1 − y ) k (cid:17) + sin (cid:32) π x λ (cid:33) sin (cid:32) π y λ (cid:33)(cid:35) cos (cid:32) π tT (cid:33) . (13)This solution u is time harmonic with period T and presents a spatial behaviour controlled by the polynomial coe ﬃ -cient Γ , being k a possitive integer, and the spatial period λ of the sinusoidal component. These four parameters allowadjusting the di ﬃ culty of the particular problem numerically solved. In particular, the test solution can switch from apurely sinusoidal function for Γ =

0, to one with signiﬁcant gradients at the boundary neighbourhood in the case of Γ and k with large magnitudes.We ﬁrst solve the case of Γ = c = κ = ρ ), and take λ = and T = √ . We perform ten numerical simulations of each scheme for an increasing numberof grid cells N = , , , , , , , , , h = N − .In each case, the global simulation time corresponds ﬁve time periods, namely T S IM = T , and the time step ∆ t istaken as the maximum permitting stable calculations, where ∆ t = hc − c f l max . The term c f l max is the well known CFLbound that varies between our two FD ADI schemes, and were found in [10]. Especiﬁcally, we employ the c f l max values of 0 .

91 and 0 .

81 in our CFD and MDF simulations, respectively. Next, a second set of harder tests with nonhomogeneous BCs are also numerically solved, where Γ (cid:44)

0. In this case, we only inspect the representative cases of k = k =

9, and also assume

Γ = k to simplify this current parametrization.Figure 1 shows absolute errors in Frobenius norm found on numerical tests performed under homogeneous andheterogeneous BCs. In general, these errors steadily decay with h validating our new parallel implementations. Onboth tests with mild di ﬃ culty, i.e., Γ =

Γ = k =

2, the MFD method is slightly more accurate on most of theexplored grids with N ≤ ﬀ erentiation, as comparedto the nodal CFD scheme. However, when solving the hardest test with Γ = k =

9, MDF errors on simulation with N = ﬀ er a clear stagnation, making useless any further grid reﬁning. Thus, we limit this experimental study,and subsequent speedup analysis, to meshes where N ≤ h reduces, and list them intables 4 and 5, respectively. These estimations are based on the standard error weighting from two consecutivesimulations. For each test, we also average the resulting rates after omitting the highest and lowest values, to quantify10 a) (b)Figure 1: Comparison of absolute errors in Frobenius norm of numerical solutions to test cases under: (a) homogeneous BCs ( Γ =

Γ = k = Γ = k = c f l MAX = N Γ = Γ = k = Γ = k = Average 4.02 2.07 2.02 the overall convergence behaviour avoiding main outliers. For both schemes, convergence degrades as the problembecomes harder, being almost quartic on the homogeneous test (limited by the spatial discretization), and reducing toquadratic on the hardest heterogeneous test (more constrained by the order of CN time integration). We would like toremark that accuracy and convergence results reported in [10], clearly show a MDF superiority on similar tests underhomogeneous BCs. This disadvantage of the former CFD method arises from a reduction of nominal accuracy fromfourth to third order at grid boundaries. Here, we opt for a fully fourth order set of CFD stencils at all grid pointswhen implementing our current method.

5. Comparison of performance results

Figures 2a and 2b show speedups in log2 scale, relatives to the original sequential Octave codes, of each imple-mentation delivered by the optimization process discussed in Section 3. These results were obtained after solvingthe numerical test under heterogeneous BCs for k =

9. Speedup values observed on the alternative heterogeneousor homogeneous tests, are very similar to the ones shown in these ﬁgures, conﬁrming that evaluation of the forcingfunction f , and related operations at boundaries, minorly a ﬀ ect computation times.The new CFD and MDF implementations using parallel Octave report a poor performance in most of our tests,being the former better that ﬁnally improves to a 1.19x ratio at N = ++ implementationsyield a global improvement that reaches 5.39x (CFD) and 2.6x (MFD) at the ﬁnest grid. This improvement is mainly11 a)(b)Figure 2: Speedup values in log ++ and parallel codes, relative to the sequential Octave implementations, ofboth methods the nodal CFD (a) and the staggered MFD (b), when solving the heterogeneous test for k = able 5: Estimated convergence rates of the MFD simulations with time stepping given by c f l MAX = N Γ = Γ = k = Γ = k = Average 3.87 1.99 1.88 due to software optimizations performed by the the compiler. Now, C ++ parallel codes that use OpenMP achievea higher improvement at all grids, and limiting speedups reach 11.1x (CFD) and 6.74x (MFD), at N = N until 38.85x at the best resolved grid. This is a clear indication of the high level of parallelism achieved by ourGPU implementation, on the range of grid sizes tackle in this work. On the other hand, speedup of the CUDA MFDmethod seems to converge steadily on grids with N ≥

96 to a limit of 40 x , with a minor improvement at the limitingmesh, and this behavior reveals a soon high occupancy of GPUs.As a complementary set of speedup results, we list in Table 6 only the ratios achieved by the parallel CFD and MFDmethods, developed either in C ++ using 8 threads or in CUDA, relative to their sequential C ++ implementations. Inmost of our grids, these speedups improve with the mesh size. In particular, the improvements accomplished by theGPU implementations grow steadily, given the high device occupancy by our bigger matrices. It is worth remindingthat grid size is limited in our tests by the numerical precision achieved by our methods, and higher reﬁnements wouldnot lead to more accurate results, at least for the MDF scheme. Thus, we do not test our parallel implementationswith bigger matrices. Again, using the sequential C ++ computations as reference, the OpenMP implementationsonly reach limiting improvements of 2.06x (CFD) and 2.59x (MFD) at N = ﬀ ectivegiving that launching times were too high with respect to the kernel working times.The testing process of ADI convergence (inner loops of algorithms 3 and 4 in Appendix C) may result a criticalrestricting factor to performance, as it demands several small device-host DMA transfers. As the number of loopiterations until convergence is roughly constant, we minimize this bottleneck by delaying checking the stoppingcriterion until a ﬁxed number of iterations is performed.

6. Conclusions

This paper ﬁrst focuses on the computer parallelization of two fourth-order FD methods to model 2-D acousticwaves on available multi-core (CPU) and many-core (GPU) architectures. The ﬁrst scheme uses CFD operatorson nodal grids and requires solving a tridiagonal system of linear equations (SEL) to compute coupled derivativevalues along each coordinate grid line. Alternatively, the second scheme employs MFD stencils on SG for localdi ﬀ erentiation free of SEL solution. Both methods share a Crank-Nicolson (CN) time integration combined to aPeaceman-Rachford splitting that lead to an e ﬃ cient ADI updating scheme. These FD ADI methods have shown better13 able 6: Speedups achieved by the parallel CFD and MFD methods with respect to their C ++ implementations in the numerical test underheterogeneous BCs ( k = Nodal CFD 16 24 32 48 64 96 128 256 512 1024 C ++ (8 threads) 2.75 2.75 2.76 2.75 2.64 2.55 2.52 2.27 2.07 2.06CUDA 0.13 0.18 0.23 0.41 0.79 1.22 1.56 2.54 4.82 7.21 Staggered MFD C ++ (8 threads) 2.29 2.12 1.96 1.78 1.98 2.04 2.03 2.15 2.31 2.59CUDA 0.13 0.29 0.45 0.85 1.17 2.73 3.08 5.94 8.08 15.81stability and convergence properties than an explicit Leapfrog integration in a previous work [10], but the computingexpenses of ADI-embedded iterations demand parallelization to be competitive on practical propagation problems.Along the implementation process of each method, we present a sequential C ++ code with optimal compilationﬂags, and develop three di ﬀ erent parallel versions. Precisely, an Octave and a C + multithreading code, both fullyexploiting array-oriented storing for matrix and vector operations, and a CUDA implementation on a NVIDIA GTX960 Maxwell.Performance speedups of parallel implementations were assessed relative to the computing times of the sequentialC ++ code, but we additionally report results with respect to the original Octave version. These results can be insightfulfor someone replicating our optimization procedures for similar numerical methods. Results of GPU methods show asigniﬁcant speedup gain over their parallel counterparts. Main reasons are a high occupancy of GPUs on explored griddensities, better realized by the CFD method, and a minimization of CPU-GPU transfers by delaying the convergencetesting of ADI-embedded iterations, until an expected number of cycles is actually performed. Relative to Octave, theGPU CFD method achieves a speedup of 38.85x at the ﬁnest grid, while the GPU MDF reaches a higher 41x. Sameperformances, but measured with respect to optimal sequential C ++ versions, are 7.21x for the former and 15.81xfor the latter. Keeping matrices in GPU memory and minimizing host-to-device transfers result essential in obtainingsuch performances.Secondly, numerical experiments carried out in this study also enlighten convergence properties of these methodson a test suite of increasing di ﬃ culty. On a problem with an exact harmonic solution, the MDF accuracy was slightlybetter, and both methods exhibit convergence rates close to 4, as the di ﬀ erentiation order used on the spatial discretiza-tion. However, both convergences decay to second order on smooth problems with signiﬁcant gradients at boundaries,that may reveal a stronger dependence on the time integration errors. In any case, these second order experimentalrates are consistent to the theoretical global convergence of 2, and limited by the time discretization. Acknowledgments

We are thankful to CAF editors and reviewers for insightful comments and revisions to this manuscript. Firstauthor was partially supported by the Generalitat de Catalunya under agreement 2017-SGR-962 and the RIS3CATDRAC project (001-P-0 01723). The research leading to these results has received funding from the European UnionsHorizon 2020 Programme, grant agreement No. 828947, and from the Mexican Department of Energy, CONACYT-SENER Hidrocarburos grant agreement No. B-S-69926. This project has also received funding from the EuropeanUnion’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No.777778 (MATHROCKS). O. Rojas also thank the European Unions Horizon 2020 Programme under the ChEESEProject, grant agreement No. 823844.

References [1] Y. Abdulkadir. Comparison of ﬁnite di ﬀ erence schemes for the wave equation based on dispersion. Journal of Applied Mathematics andPhysics , 3:1544–1562, 2015.[2] L. Di Bartolo, C. Dors, and W. J. Mansur. A new family of ﬁnite-di ﬀ erence schemes to solve the heterogeneous acoustic wave equation. Geophysics , 77(5):T187–T199, 2012.[3] J. O. Blanch and J. O. A. Robertsson. A modiﬁed lax-wendro ﬀ correction for wave propagation in media described by zener elements. Geophysical Journal International , 131(2):381–386, 1997.

4] T. Bohlen and F. Wittkamp. Three-dimensional viscoelastic time-domain ﬁnite-di ﬀ erence seismic modelling using the staggered adamsbash-forth time integrator. Geophysical Journal International , 204(3):1781–1788, 2016.[5] J. E. Castillo and R. D. Grone. A matrix analysis approach to higher-order approximations for divergence and gradients satisfying a globalconservation law.

SIAM Journal Matrix Analysis Applications , 25(1):128–142, 2003.[6] J. E. Castillo, J. M. Hyman, M. Shashkov, and S. Steinberg. Fourth- and sixth-order conservative ﬁnite di ﬀ erence approximations of thedivergence and gradient. Applied Numerical Mathematics: Transactions of IMACS , 37(1–2):171–187, 2001.[7] C. Cerjan, D. Koslo ﬀ , R. Koslo ﬀ , and M. Reshef. A nonreﬂecting boundary condition for discrete acoustic and elastic wave equations. Geophysics , 50(4):705–708, 1985.[8] M. Ciment and S. H. Leventhal. Higher order compact implicit schemes for the wave equation.

Mathematics of Computation , 29(132):985–994, 1975.[9] L. C´ordova.

Diferencias ﬁnitas compactas nodales y centro distribuidas aplicadas a la simulacin de ondas acsticas . PhD thesis, UniversidadCentral de Venezuela, 10 2017.[10] L. C´ordova, O. Rojas, B. Otero, and J. E. Castillo. Compact ﬁnite di ﬀ erence modeling of 2-d acoustic wave propagation. Journal ofComputational and Applied Mathematics , 295:83–91, 2016.[11] D. M. Dang, C. C. Christara, and K. R. Jackson. A parallel implementation on gpus of adi ﬁnite di ﬀ erence methods for parabolic pdes withapplications in ﬁnance. Canadian Applied Mathematics Quarterly , 17(4):627–660, 2009.[12] D-M Dang, C. C. Christara, and K. R. Jackson. Graphics processing unit pricing of exotic cross-currency interest rate derivatives with aforeign exchange volatility skew model.

Concurrency and Computation Practice and Experience , 26(9), 2010.[13] D. Deng and C. Zhang. Application of a fourth-order compact adi method to solve a two-dimensional linear hyperbolic equation.

InternationalJournal of Computer Mathematics , 90(2):273–291, 2013.[14] H. Dongdong. An unconditionally stable spatial sixth-order ccd-adi method for the two-dimensional linear telegraph equation.

NumericalAlgorithms , 72(4):1103–1117, 2016.[15] D. Eglo ﬀ . Gpu in ﬁnancial computing part iii: Adi solvers on gpus with application to stochastic volatility. WILMOTT Magazine , pages51–53, 2011.[16] V. Esfahanian, B. Baghapour, M. Torabzadeh, and H. Chizari. An e ﬃ cient gpu implementation of cyclic reduction solver for high-ordercompressible viscous ﬂow simulations. 92, 2013.[17] J. T. Etgen. and M. J. O’Brien. Computational methods for large-scale 3d acoustic ﬁnite-di ﬀ erence modeling tutorial. Geophysics , 72(5),2007.[18] M. Giles, E. L´aszl´o, I. Reguly, J. Appleyard, and J. Demouth. Gpu implementation of ﬁnite di ﬀ erence solvers. In Proceedings of the 7thWorkshop on High Performance Computational Finance , WHPCF ’14, pages 1–8, Piscataway, NJ, USA, 2014. IEEE Press.[19] R. W. Graves. Simulating seismic wave propagation in 3d elastic media using staggered-grid ﬁnite di ﬀ erences. Bulletin of the SeismologicalSociety of America , 86(4):1091–1106, 1996.[20] W. M. W. Hwu, editor.

GPU Computing Gems Jade Edition . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2011.[21] S. R. Iyengar and R. C. Mittal. High order di ﬀ erence schemes for the wave equation. International Journal for Numerical Methods inEngineering , 12(10):1623–1628, 1978.[22] Q. Jinggang. The new alternating directiion implicit di ﬀ erence methods for the wave equations. Journal of Computational and AppliedMathematics , 230:213–223, 2009.[23] H. S. Kim, S. Wu, L. w. Chang, and W. W. Hwu. A scalable tridiagonal solver for gpus. In , pages 444–453, Sept 2011.[24] S. Kim and H. Lim. High-order schemes for acoustic waveform simulation.

Applied Numerical Mathematics , 57(4):402–414, 2007.[25] S. K. Lele. Compact ﬁnite di ﬀ erence schemes with spectral-like resolution. Journal of Computational Physics , 103:16–42, 1992.[26] A. Levander. Fourth-order ﬁnite-di ﬀ erence p-sv seismograms. Geophysics , 53:1425–1436, 1988.[27] C. Li-Wen and W. W. Hwu. A guide for implementing tridiagonal solvers on gpus. pages 29–44, 2014.[28] W. Liao. On the dispersion, stability and accuracy of a compact higher-order ﬁnite di ﬀ erence scheme for 3d acoustic wave equation. Journalof Computational and Applied Mathematics , 270:571–583, 2014.[29] V. G. Mandikas and E. N. Mathioudakis. A parallel multigrid solver for incompressible ﬂows on computing architectures with accelerators.

The Journal of Supercomputing , 73(11):4931–4956, 2017.[30] S. McKee. High accuracy adi methods for hyperbolic equations with variable coe ﬃ cients. IMA Journal of Applied Mathematics , 11(1):105–109, 1973.[31] D. Michea and D. Komatitisch. Accelerating a three-dimensional ﬁnite-di ﬀ erence wave propagation code using gpu graphics cards. Geo-physical Journal International , 182(1):389–402, 2010.[32] F. Moya. Parallelization of ﬁnite di ﬀ erence methods: Nodal and mimetic solutions of the wave equation. Master’s thesis, 2016.[33] B. Otero, J. Franc´es, R. Rodr´ıguez, O. Rojas, F. Solano, and J. Guevara-Jordan. A performance analysis of a mimetic ﬁnite di ﬀ erence schemefor acoustic wave propagation on gpu platforms. Concurrency and Computation: Practice and Experience , 29(4):e3880, 2017.[34] D. W. Peaceman and H. H. Rachford. The numerical solution of parabolic and elliptic di ﬀ erential equations. Journal of the Society forIndustrial and Applied Mathematics , 3(1):28–41, 1955.[35] J. Qian, S. Wu, and R. Cui. Accuracy of the staggered-grid ﬁnite-di ﬀ erence method of the acoustic wave equation for marine seismic reﬂectionmodeling. Chinese Journal of Oceanology and Limnology , 31(1):169–177, Jan 2013.[36] O. Rojas.

Modeling of rupture propagation under di ﬀ erent friction laws using high-order mimetic operators . PhD thesis, Claremont GraduateUniversity joint to San Diego State University, California, USA, 2009.[37] O. Rojas, S. M. Day, J. E. Castillo, and L. A. Dalguer. Modelling of rupture propagation using high-order mimetic ﬁnite di ﬀ erences. Geophysical Journal International , 172(2):631–650, 2008.[38] O. Rojas, B. Otero, J. E. Castillo, and S. M. Day. Low dispersive modeling of rayleigh waves on partly staggered grids.

ComputationalGeosciences , 18(1):29–43, 2014.[39] F. Rubio, M. Hanzich, A. Farr´es, J. De la Puente, and J. M. Cela. Finite-di ﬀ erence staggered grids in gpus for anisotropic elastic wave ropagation simulation. Comput. Geosci. , 70(C):181–189, 2014.[40] B. Runyan.

A novel higher order ﬁnite di ﬀ erence time domain method based on the Castillo-Grone mimetic curl operator with applicationsconcerning the time-dependent Maxwell equations . PhD thesis, San Diego State University, San Diego, USA, 2011.[41] E. H. Saenger and T. Bohlen. Finite-di ﬀ erence modeling of viscoelastic and anisotropic wave propagation using the rotated staggered grid. Geophysics , 69:583–591, 2004.[42] E. H. Saenger, N. Gold, and S. A. Shapiro. Modeling the propagation of elastic waves using a modiﬁed ﬁnite-di ﬀ erence grid. Wave Motion ,31:77–92, 2000.[43] A. Sei and W. Symes. Dispersion analysis of numerical wave propagation and its computational consequences.

Journal of Scientiﬁc Comput-ing , 10:1–27, 01 1995.[44] J. Sochacki, R. Kubichek, J. George, W. R. Fletcher, and S. Smithson. Absorbing boundary conditions and surface waves.

Geophysics ,52(1):60–71, 1987.[45] A. T. Srinath. A novel approach to evaluating compact ﬁnite di ﬀ erences and similar tridiagonal schemes on gpu-accelerated clusters. Master’sthesis, Clemson University, 2015.[46] T. P. Stefanski and T. D. Drysdale. Acceleration of the 3d adi-fdtd method using graphics processor units. , 2009.[47] S. Sudarmaji, S. Sismanto, Waluyo, and B. Soedijono. Numerical modeling of 2d seismic wave propagation in ﬂuid saturated porous media us-ing graphics processing unit (gpu): Study case of realistic simple structural hydrocarbon trap. AIP Conference Proceedings , 1755(1):100001,2016.[48] B. Tutkun and F. O. Edis. A gpu application for high-order compact ﬁnite di ﬀ erence scheme. Computers & Fluids , 55:29 – 35, 2012.[49] Z. Wang, S. Peng, and T. Liu. Gpu accelerated 2-d staggered-grid ﬁnite di ﬀ erence seismic modelling. Journal of Software , 6(8):1554–1561,2011.[50] Z. Wei, B. Jang, Y. Zhang, and Y. Jia. Parallelizing alternating direction implicit solver on gpus.

Procedia Computer Science , 18:389 – 398,2013.[51] L. J. Wicker and W. C. Skamarock. Time-splitting methods for elastic models using forward time schemes.

Monthly Weather Review ,130(8):2088–2097, 2002.[52] W. Zhang, Z. Zhang, and X. Chen. Threedimensional elastic wave numerical modelling in the presence of surface topography by a collocat-edgrid ﬁnite - di ﬀ erence method on curvilinear grids. Geophysical Journal International , 190(1):358–378, 2012.[53] Y. Zhang and Y. Jia. Parallelization of implicit cche2d model using cuda programming techniques. In

World Environmental and WaterResources Congress , Cincinnati, Ohio, 2013.[54] J. Zhou, Y. Cui, E. Poyraz, D. J. Choi, and C. C. Guest. Multi-gpu implementation of a 3d ﬁnite di ﬀ erence time domain earthquake code onheterogeneous supercomputers. Procedia Computer Science , 18:1255 – 1264, 2013.

Appendices

Appendix A: Compact ﬁnite di ﬀ erence matrices Lele in [25] presents a Taylor-based construction of compact ﬁnite di ﬀ erence (CFD) formulas of second, fourthand higher order accuracy on either nodal or staggered grids. In this work, we use the fourth order accurate CFDmatrix operators P and Q that allow di ﬀ erentiation of a smooth function v on [0 , x i = i ∗ h , for h = N and i = , · · · , N , are the components of vector V . Nodal approximations to dvdx result fromsolving the banded linear system PV x = QV , for the following ( N + × ( N +

1) matrices P and QP =  · · ·

01 4 1 0 · · ·

00 1 4 1 0 · · · . . . . . . . . . · · · · · ·  , (14) Q = h  −

17 9 9 − · · ·− − . . . . . . . . . − − −  . (15)16bove, matrix sub-indexes reﬂect the accuracy choice among the alternative CFD operators given in [25]. Thisformulation is quite general yielding a three-parametric family of CFD operators with fourth order accuracy, at least,and those given above result from reducing matrix bandwidth to assure P tridiagonally.An alternative pair of CFD operators can be derived from each P and Q set, by omitting the approximations to dvdx at grid edges. In this case, the reduced vector ¯ V x ∈ (cid:60) N − only collects values of v at interior grid nodes, and the newCFD approximation to dvdx at point x result from the linear combination of original formulas at nodes x and x . Thesame formula is now used at x N − , after switching signs of CFD coe ﬃ cients to account for backward di ﬀ erentiation.This simple process leads to the reduced matrix operators ¯ P and ¯ Q of dimensions ( N − × ( N −

2) and ( N − × N ,respectively, that take the following form in the case of 14 given above¯ P =  · · ·

01 4 1 0 · · ·

00 1 4 1 0 · · · . . . . . . . . . · · · · · ·  , (16)¯ Q = h  − − · · ·− − . . . . . . . . . − − −  . (17)Note that ¯ P is also tridiagonal which permits using same algorithm on the calculation of both V and ¯ V x . Appendix B: Staggered ﬁnite di ﬀ erences For fourth order ﬁnite di ﬀ erentiation on staggered grids (SG), we employ the 1-D mimetic operators developed in[5] and [6], that have been applied to wave propagation by [10], [36],[37], [38] and [40]. On the interval [0 , N cells of length h = N , where nodes x i = i ∗ h and cell centers x i + / = x i + x i + correspond to the evaluation points of two given scalar functions v and u . Speciﬁcally, nodal v evaluations are collectedin vector V ∈ (cid:60) N + , while evaluations of u at cell centers and grid edges conform vector U ∈ (cid:60) N + . Staggereddi ﬀ erentiation is carried out by two separate FD matrix operators D and G , where components of DV approximate dvdx at every cell center, while GU yields approximations to dudx at all nodes. For clarity, the following ﬁgure illustrates thespatial distributions of components of V , U , DV and GU on a 1-D SGBy construction, each D or G operator with fourth order accuracy, comprises an individual set of SG FD stencilsdependent on three free parameters α , β and γ . An optimal parametrization has been recently proposed in [9] topreserve the self adjointness between mimetic D or G , an important property satisﬁed by their continuous di ﬀ erentialcounterparts divergence and gradient operators. However, we here adopt same parameter values α = β =

0, and γ − used in [10], [37], and [38] that lead to D and G with minimun bandwidth, to favor computational e ﬃ ciency.17 =  − − − · · · −

98 98 − · · · −

98 98 − · · ·  (18) G =  − − − · · · − −

340 1168 · · · −

98 98 − · · · −

98 98 − · · ·  (19)Above, matrix sub-indexes are used to reinforce the accuracy order of discretization employed in this work, be-cause Castillo and co-workers also present operators D and G with alternative accuracies.18 ppendix C: ADI solversAlgorithm 3 ADI iterations in the nodal CFD scheme

Require: U ∈ R ( N + × ( N + , V ∈ R ( N + × ( N + , A ∈ R ( N − × ( N − , k > , ε > Ensure: U ∗ ∈ R ( N + × ( N + , V ∗ ∈ R ( N + × ( N + procedure ADI- rows ( U , V , A , B ) U k ← U (¯: , ¯:) V k ← V (¯: , :) Repeat U k + ← A − ∆ t K . ∗ (cid:16) V k ¯Q T (cid:17) / ¯P T V k + ← B − ∆ t R . ∗ (cid:16) U k + Q T (cid:17) / P T test ← || U k + − U k || + || V k + − V k || U k ← U k + V k ← V k + k ← k + until (( test ≤ ε ) || ( k ≥ k max )) return U ∗ ← U k , V ∗ ← V k end procedureRequire: U ∈ R ( N + × ( N + , W ∈ R ( N + × ( N + , C ∈ R ( N − × ( N − , k > , ε > Ensure: U m + ∈ R ( N + × ( N + , W m + ∈ R ( N + × ( N + procedure ADI- columns ( U , W , C , D ) U k ← U (¯: , ¯:) W k ← W (: , ¯:) Repeat U k + ← C − ∆ t K . ∗ ¯P \ (cid:16) ¯QW k (cid:17) W k + ← D − ∆ t R . ∗ P \ ( QU k + ) test ← || U k + − U k || + || W k + − W k || U k ← U k + W k ← W k + k ← k + until (( test ≤ ε ) || ( k ≥ k max )) return U m + ← U k , W m + ← W k end procedure lgorithm 4 ADI iterations in the staggered MFD scheme

Require: U ∈ R ( N + × ( N + , V ∈ R ( N + × ( N + , A ∈ R ( N − × ( N − , D ∈ R N × ( N + , G ∈ R ( N + × ( N + , k > , ε > Ensure: U ∗ ∈ R ( N + × ( N + , V ∗ ∈ R ( N + × ( N + procedure ADI- rows ( U , V , A , B ) U k ← U (¯: , ¯:) V k ← V (¯: , :) Repeat U k + ← A − ∆ t K . ∗ ( V k D T ) V k + ← B − ∆ t R . ∗ ( U k + G T ) test ← || U k + − U k || + || V k + − V k || U k ← U k + V k ← V k + k ← k + until (( test ≤ ε ) || ( k ≥ k max )) return U ∗ ← U k , V ∗ ← V k end procedureRequire: U ∈ R ( N + × ( N + , W ∈ R ( N + × ( N + , C ∈ R ( N − × ( N − , k > , ε > Ensure: U m + ∈ R ( N + × ( N + , W m + ∈ R ( N + × ( N + procedure ADI- columns ( U , W , C , D ) U k ← U (¯: , ¯:) W k ← W (: , ¯:) Repeat U k + ← C − ∆ t K . ∗ ( D W k ) W k + ← D − ∆ t R . ∗ ( G U k + ) test ← || U k + − U k || + || W k + − W k || U k ← U k + W k ← W k + k ← k + until (( test ≤ ε ) || ( k ≥ k max )) return U m + ← U k , W m + ← W k end procedureend procedure