A parallel hybrid implementation of the 2D acoustic wave equation
AA PARALLEL HYBRID IMPLEMENTATION OF THE 2DACOUSTIC WAVE EQUATION
ARSHYN ALTYBAY, MICHAEL RUZHANSKY, AND NIYAZ TOKMAGAMBETOV
Abstract.
In this paper, we propose a hybrid parallel programming approach fora numerical solution of a two-dimensional acoustic wave equation using an implicitdifference scheme for a single computer. The calculations are carried out in animplicit finite difference scheme. First, we transform the differential equation intoan implicit finite-difference equation and then using the ADI method, we split theequation into two sub-equations. Using the cyclic reduction algorithm, we calculatean approximate solution. Finally, we change this algorithm to parallelize on GPU,GPU+OpenMP, and Hybrid (GPU+OpenMP+MPI) computing platforms.The special focus is on improving the performance of the parallel algorithms tocalculate the acceleration based on the execution time. We show that the code thatruns on the hybrid approach gives the expected results by comparing our resultsto those obtained by running the same simulation on a classical processor core,CUDA, and CUDA+OpenMP implementations. Introduction
The reduction of computational time for long-term simulation of physical processesis a challenge and an important issue in the field of modern scientific computing.The cost of supercomputer, CPU clusters and hybrid clusters with a large number ofGPUs are very expensive and they consume a lot of energy, which is inaccessible andineffective to some small laboratories and individuals.Nowadays, new generation computers are multi-core, hybrid architecture and theircomputational power is also quite high. For example, the Intel Xeon E5-2697 v2(2S-E5) processors theoretically has computing power of about 19.56 GFLOPS, and,accordingly, the computational power of the NVIDIA TITAN Xp video card is aboutup to 379.7 GFLOPS. If we use the computing power of the CPU and GPU together,we can show good results.The goal of this work is to develop a parallel hybrid implementation of the finite-difference method for solving two-dimensional wave equation using CUDA, CUDA +OpenMP and CUDA + OpenMP + MPI technologies and to study the parallelizationefficiency by comparing the time to solve this problem with the above approaches.
Mathematics Subject Classification.
Key words and phrases.
Parallel computing, GPU, CUDA, MPI, Open MP, acoustic equation,wave equation.The authors were supported by the FWO Odysseus 1 grant G.0H94.18N: Analysis and PartialDifferential Equations. MR was supported in parts by the EPSRC Grant EP/R003025/1, by theLeverhulme Research Grant RPG-2017-151. AA was supported by the MESRK Grants AP08052028and AP08053051 of the Ministry of Education and Science of the Republic of Kazakhstan. a r X i v : . [ phy s i c s . c o m p - ph ] J un ARSHYN ALTYBAY, MICHAEL RUZHANSKY, AND NIYAZ TOKMAGAMBETOV
Already for several years, GPUs have been used to accelerate well parallelizable com-puting, only with the advent of a new generation of GPUs with multicore architecture,this direction began to give palpable results.For multidimensional problems, the efficiency of an implicit compact differencescheme depends on the computational efficiency of the corresponding matrix solvers.From this point of view, the ADI method [1] is promising because they can decomposea multidimensional problem into a series of one-dimensional problems. It has beenshown that schemes acquired are unconditionally stable. For the proper assignmentof large domains of modeling, two- or three-dimensional computational grids witha sufficient number of points are used. Calculations on such grids require moreCPU time and computer memory resources. To accelerate the computation process,GPU, OpenMP, MPI technologies were used in this paper, which allows the programto operate on larger grids. With GPU becoming a viable alternative to CPU forparallel computing, the aforementioned parallel tridiagonal solvers and other hybridmethods have been implemented on GPUs [4]–[11]. In this paper, we propose threedifferent parallel programming approaches using hybrid CUDA, OpenMP and MPIprogramming for personal computers. There are many examples in the literature ofsuccessfully using hybrid approaches for different simulation [12]–[17].Here we study some issues in the numerical simulation of some problems in thepropagation of acoustic waves on high performance computing systems.2.
Problem Statement and Numerical Scheme
We consider 2D acoustic wave equation with the positive ”speed” function c andthe source term f (2.1) ∂ u∂t − c ( t ) (cid:18) ∂ u∂x + ∂ u∂y (cid:19) = f ( t, x, y ) , ( t, x, y ) ∈ [0; T ] × [0; l ] × [0; l ] , subject to the initial conditions(2.2) u (0 , x, y ) = ϕ ( x, y ) , x, y ∈ [0 , l ] , (2.3) ∂u (0 , x, y ) ∂t = ψ ( x, y ) , x, y ∈ [0 , l ] , and boundary conditions(2.4) u ( t, x,
0) = 0 , u ( t, x, l ) = 0 , t ∈ [0 , T ] , x ∈ [0 , l ] , (2.5) u ( t, , y ) = 0 , u ( t, l, y ) = 0 , t ∈ [0 , T ] , y ∈ [0 , l ] . In what follows, we take all data, namely, the coefficient c , the source function f , theinitial functions ϕ and ψ , smooth enough. Due to the notion of ”very weak solutions”and the approach developed by Garetto and Ruzhansky in [18], we can deal with 2Dacoustic wave equation with singular ( δ –like) data approximating them by smoothfunctions. For more details on these techniques and applications, we refer to thepapers [18, 19, 20, 21, 22, 23, 24].For numerical simulations, we introduce a space-time grid with steps h , h , τ re-spectively, in the variables x, y, t :(2.6) ω τh ,h = { t k = kτ, k = 1 , M ; x i = ih , i = 1 , N ; y j = jh , j = 1 , N } , PARALLEL HYBRID IMPLEMENTATION OF THE 2D ACOUSTIC WAVE EQUATION 3 and(2.7) Ω τh ,h = { t k = kτ, k = 0 , M ; x i = ih , i = 0 , N ; y j = jh , j = 0 , N } , where h = l/N , h = l/N and τ = T /M .On this grid we approximate the problem (2.1)–(2.5) using the finite differencemethod. For simplicity, we put N := N = N and denote h := h = h . Considerthe Crank-Nicolson scheme for the problem (2.1)–(2.5) u k +1 i,j − u ki,j + u k − i,j τ − c k +1 h ( u k +1 i +1 ,j − u k +1 i,j + u k +1 i − ,j + u k +1 i,j +1 − u k +1 i,j + u k +1 i,j − ) − c k − h ( u k − i +1 ,j − u k − i,j + u k − i − ,j + u k − i,j +1 − u k − i,j + u k − i,j − ) = f ki,j , (2.8)for ( k, i, j ) ∈ ω τh ,h , with initial conditions u i,j = ϕ i,j , u i,j − u i,j = τ ψ i,j , (2.9)for ( i, j ) ∈ , N × , N , and with boundary conditions u k ,j = 0 , u kN,j = 0 , u ki, = 0 , u ki,N = 0 , (2.10)for ( j, k ) ∈ , N × , M and ( i, k ) ∈ , N × , M , respectively.It is well-known, that the implicit scheme is unconditionally stable and it hasaccuracy order O ( τ + | h | ), see, for example [25]. We solve the difference equation(2.8) by the alternating direction implicit (ADI) method, namely, dividing it into twosub-problems u k +1 / i,j − u ki,j + u k − / i,j τ − c k +1 / h ( u k +1 / i +1 ,j − u k +1 / i,j + u k +1 / i − ,j ) − c k − / h ( u k − / i +1 ,j − u k − / i,j + u k − / i − ,j ) = f ki,j , (2.11)and u k +1 i,j − u k +1 / i,j + u ki,j τ − c k +1 h ( u k +1 i,j +1 − u k +1 i,j + u k +1 i,j − ) − c k h ( u ki,j +1 − u ki,j + u ki,j − ) = f k +1 / i,j . (2.12) 3. Hybrid parallel computing model
High-performance computing uses parallel computing to achieve high levels of per-formance. In parallel computing, the program is divided into many subroutines, andthen they are all executed in parallel to calculate the required values. In this section,we will propose a hybrid parallel approach numerically solving a two-dimensionalwave equation, for this, we use CUDA, MPI OpenMP technologies.
ARSHYN ALTYBAY, MICHAEL RUZHANSKY, AND NIYAZ TOKMAGAMBETOV
CUDA approach.
The graphics processing unit (GPU) is a highly parallel,multi-threaded, and multi-core processor with enormous processing power. Its lowcost and high bandwidth floating point operations and memory access bandwidth areattracting more and more high performance computing researchers [32]. In addition,compared to cluster systems, which consist of several processors, computing on a GPUis inexpensive and requires low power consumption with equivalent performance. Inmany disciplines of science and technology, users were able to increase productivityby several orders of magnitude using graphics processors [2], [3]. The year 2007,with the appearance of the CUDA programming language, programming GPUs onNVIDIA graphics cards became significantly simpler because its syntax is similar toC[26].It is designed so that its constructions allow a natural expression of concurrency atthe data level. A CUDA program consists of two parts: a sequential program runningon the CPU, and a parallel part running on the GPU [3], [31]. The parallel part iscalled the kernel. A CUDA program automatically uses more parallelism on GPUsthat have more processor cores.A C program using CUDA extensions hand out a large number of copies of thekernel into available multiprocessors to be performed simultaneously.The CUDA code consists of three computational steps: transferring data to theglobal GPU memory, running the CUDA core, and transferring the results from theGPU to the CPU memory. We have designed a CUDA program based on cyclicreduction method, whose full CR function codes are located in [29]. The algorithmfor solving the problem (2.1)–(2.5) is shown in Algorithm 1.
Algorithm 1
Implementation of 2D wave equationcompute initial function matrix U u = U while ( t < t end ) do for j = 0 , . . . , n for i = 0 , . . . , n calculate tridiagonal system elements a i , b i , c i , f i call function CR ( a i , b i , c i , f i , y i , n )calculate matrix U x for i = 0 , . . . , n for j = 0 , . . . , n calculate tridiagonal system elements aj, bj, cj, f j call function CR ( a j , b j , c j , f j , y j , n )calculate matrix U y swap ( u, U x )swap ( U , U y ) t ← t + (cid:77) t Here, u , U U x , U y denote u k − / i,j , u ki,j , u k +1 / i,j , u k +1 i,j , respectively. The CR ()function includes 3 device functions, namely, CRM f orward () , cr div () ,andCRM backward (), and one host function calc dim (). First we have to calculate PARALLEL HYBRID IMPLEMENTATION OF THE 2D ACOUSTIC WAVE EQUATION 5 the block size according to the size of the matrix and step numbers of forward andbackward sub-steps. For this, we use one cyclefor ( i = 0; i < log 2( n + 1) − i + +) { stepN um = ( n − pow (2 . , i + 1)) /pow (2 . , i + 1) + 1; calc dim ( stepN um, & dimBlock, & dimGrid ); CRM f orward <<< dimGrid, dimBlock >>> ( d a, d b, d c, d f, n, stepN um, i ); } Here log 2( n + 1) − stepN um . It is to identifythe block size. Therefore, we need the function calc dim (), which is identifying theblock sizes. After that the function CRM f orward () runs log 2( n + 1) − cr div (). This function calculates two unknowns. Thenwe use one cyclefor ( i = log 2( n + 1) − i > = 0; i − − ) { stepN um = ( n − pow (2 . , i + 1)) /pow (2 . , i + 1) + 1; calc dim ( stepN um, & dimBlock, & dimGrid ); CRM backward <<< dimGrid, dimBlock >>> ( d a, d b, d c, d f, d x, n, stepN um, i ); } Here the backward substitution runs log 2( n +1) − calc dim (). Thus, we calculate d x array. After that we copy the calculated data d x from the device to the host using cudaM emcpy ( y, d x , sizeof ( double ) ∗ n, cudaM emcpyDeviceT oHost ).3.2. OpenMP+CUDA approach.
OpenMP (Open Multi-Processing) was intro-duced to provide the means to implement shared memory concurrency in FORTRANand C/C ++ programs. In particular, OpenMP defines a set of environment vari-ables, compiler directives and library procedures that will be used for parallelizationwith shared memory. OpenMP was specifically designed to use certain characteris-tics of shared memory architectures, such as the ability to directly access memorythroughout a system with low latency and very fast shared memory locking [27].We can easy parallelize loops by using MPI thread libraries and invlove the OpenMPcompilers. In this way, the threads can obtain new tasks, the unprocessed loop iter-ations directly from local shared memory. The basic idea behind OpenMP is data-shared parallel execution. It consists of a set of compiler directives, callable runtimelibrary routines and environment variables that extend FORTRAN, C, and C++programs. The working unit of OpenMP is a thread. It works well when accessingshared data costs you nothing. Every thread can access a variable in a shared cacheor RAM.In this paper, we use OpenMP to solve an initial boundary value problem. Since dodeal with it we use two cycles and calculate one matrix. Moreover, OpenMP parallelcomputing model is very convenient to implement, and it has low latency and highbandwidth.3.3.
Hybrid approach.
The message passing interface (MPI) is a standardized andportable programming interface for exchanging messages between multiple processorsexecuting a parallel program in distributed memory.
ARSHYN ALTYBAY, MICHAEL RUZHANSKY, AND NIYAZ TOKMAGAMBETOV
MPI works well on a wide variety of distributed storage architectures and is ideal forindividual computers and clusters. However, MPI depends on explicit communicationbetween parallel processes which requires mesh decomposition in advance due to datadecomposition. Therefore, MPI can cause load balancing and consume extra time.Since MPICH2 is freely accessible here in our implementations we use it. In [8] theauthors used a compact implementation of the MPI standard for message-passing fordistributed-memory applications. MPICH is a free software. Also, it is available formost types of UNIX and Microsoft Windows systems. MPI is standardized on manylevels. Indeed, it provides many advantages for the users. One of them makes yousure that MPI codes can be executed in any MPI implementation launching on yourarchitecture, even if the syntax has been standardized. Since the functional behaviorof the MPI calls is also standardized, its should behave in the same way whatever ofimplementation, which ensures the portability of the parallel programs.We use MPI technology to calculate the elements of the tridiagonal matrix sys-tem, i.e ai, bi, ci, f i because these values can be calculated independently, so we cansuccessfully apply MPI technology here.Listing code 1 shows the program code.
Listing 1.
Calculation of ai, bi, ci, f i i1 = (n*rank) / size;i2 = (n*(rank + 1)) / size;for (i = i1; i PARALLEL HYBRID IMPLEMENTATION OF THE 2D ACOUSTIC WAVE EQUATION 7 These parallel technologies, CUDA, OpenMP and MPI can be combined to form amulti-layered hybrid structure, the premise is that the system has several CPU coresand at least one graphics processor. Under this hybrid structure (Figure 1), we canmake better use of the advantages of another programming model. Figure 1. Flowchart of hybrid approach4. Experimental Results In this section we show the results obtained on a desktop computer with configu-ration 4352 cores GeForce RTX 2080 TI, NVIDIA GPU together with a CPU IntelCore(TM) i7-9800X, 3.80 GHz, RAM 64Gb. Simulation parameters are configuredas follows. Mesh size is uniform in both directions with ∆ x = ∆ y = 0 . 01, coefficients c = 1 and numerical time step ∆ t is 0.02, and simulation time is T = 1 . 0, thereforethe total numerical time step is 50.Using the implicit sub-scheme (2.11), the cyclic reduction [30] method is performedin the x direction, with the result that we get the grid function u k +1 / i,j . In the secondfractional time step, using the sub-scheme (2.12), the cyclic reduction method isperformed in the direction of the y axis, as the result we get the grid function u k +1 i,j .To present more realistic data we test four cases with large domain sizes of 1024 × , × , × × ARSHYN ALTYBAY, MICHAEL RUZHANSKY, AND NIYAZ TOKMAGAMBETOV Table 1. Execution Time (Seconds)N (mesh size) CPU GPU GPU/OpenMP GPU/OpenMP/MPI2 CPU core 4 CPU core 8 CPU core1024 × × × × Conclusion and Future Work In this paper, we proposed the numerical solution of the 2D acoustic wave equationbased on an implicit finite difference scheme using the cyclic reduction method. And,we constructed a heterogeneous hybrid programming environment for a single PC bycombining the message passing interface MPI, OpenMP, and CUDA programming.Also, implemented parallelization of the cyclic reduction method on the graphic pro-cessing unit. Finally, we showed how we accelerated the cyclic reduction method onthe NVIDIA GPU. From the test results of table 1 it can be seen that the accelera-tion method proposed by us gives a good result. Our hybrid CUDA/OpenMP/MPIimplementation obtained up to 2.75 times faster results than CUDA implementation.In future work, we are planning to improve and adapt our hybrid approach forGPU cluster and test it. References [1] D. W. Peaceman, H. H. Rachford. The Numerical Solution of Parabolic and Elliptic DifferentialEquations, Journal of the Society for Industrial and Applied Mathematics NVIDIA TechnicalReport , 2008.[3] E. Elsen, P. LeGresley, E. Darve. Large calculation of the flow over a hypersonic vehicle usinga GPU, J. Comput. Phys , 227,1014810161, 2008.[4] Y. Zhang, J. Cohen, J. Owens, Fast tridiagonal solvers on the GPU, ACM Sigplan Notices ,45:5,127136,2010[5] Y. Zhang, J. Cohen, A. Davidson, J. Owens, A hybrid method for solving tridiagonal systemson the GPU, GPU Computing Gems Jade Edition , 117,2011.[6] A. Davidson, J. Owens Register packing for cyclic reduction: a case study, Proceedings of theFourthWorkshop on General Purpose Processing on Graphics Processing Units, ACM, 4,2011.[7] A. Davidson, Y. Zhang, J. Owens An auto-tuned method for solving large tridiagonal systemson the GPU, Parallel and Distributed Processing Symposium (IPDPS), IEEE International,IEEE , 2011,956965,2011.[8] D. Goddeke, R. Strzodka. Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision multigrid, Parallel and Distributed Systems, IEEE Transactions , 22:1, 2232, 2011.[9] H. Kim, S.Wu, L. Chang, W. Hwu. A scalable tridiagonal solver for GPUs, Parallel Processing(ICPP), 2011 International Conference on, IEEE , 444453, 2011.[10] N. Sakharnykh. Tridiagonal solvers on the GPU and applications to fluid simulation, GPUTechnology Conference , 2009.[11] Z. Wei, B. Jang, Y. Zhang, Y.Jia. Parallelizing Alternating Direction Implicit Solver on GPUs, International Conference on Computational Science, ICCS, Procedia Computer Science PARALLEL HYBRID IMPLEMENTATION OF THE 2D ACOUSTIC WAVE EQUATION 9 [12] F. Bodin, S. Bihan. Heterogeneous multicore parallel programming for graphics processingunits, J. Sci. Programming Computer Physics Communications International Journal of Grid and Distributed Computing Vol. 9,9,67-82,2016http://dx.doi.org/10.14257/ijgdc.2016.9.9.07[15] A. L. D, J.E. Roman. MPI-CUDA parallel linear solvers for block-tridiagonal matrices in thecontext of SLEPcs eigensolvers, Parallel Computing Earthq Sci J. Supercomputing , in press, doi:10.1007/s11227-009-0360-z, SpringerLink Online Date: Nov. 18, 2009.[18] C. Garetto and M. Ruzhansky. Hyperbolic Second Order Equations with Non-Regular TimeDependent Coefficients. Arch. Rational Mech. Anal. , 217(1):113–154, 2015.[19] M. Ruzhansky and N. Tokmagambetov. Wave equation for operators with discrete spectrumand irregular propagation speed. Arch. Ration. Mech. Anal. , 226(3):1161–1207, 2017.[20] M. Ruzhansky, N. Tokmagambetov. Very weak solutions of wave equation for Landau Hamil-tonian with irregular electromagnetic field. Lett. Math. Phys. , 107:591–618, 2017.[21] M. Ruzhansky and N. Tokmagambetov. On a very weak solution of the wave equation for aHamiltonian in a singular electromagnetic field. Math. Notes , 103(5–6):856–858, 2018.[22] J. C. Munoz, M. Ruzhansky and N. Tokmagambetov. Wave propagation with irregular dissipa-tion and applications to acoustic problems and shallow waters. J. Math. Pures Appl. , 123:127–147, 2019.[23] J. C. Munoz, M. Ruzhansky and N. Tokmagambetov, Acoustic and Shallow Water Wave Prop-agation with Irregular Dissipation. Funct. Anal. Appl. , 53(2):153–156, 2019.[24] M. Ruzhansky and N. Tokmagambetov. Wave Equation for 2D Landau Hamiltonian. Appl.Comput. Math. , 18(1):69-78, 2019.[25] A.A. Samarskii. The Theory of Difference Schemes. CRC Press Cambridge University Press, PA-P/CDR edition , 17-30, 2003[28] D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. Buijssen, M. Grajewski, S. Tureka,Exploring weak scalability for FEM calculations on a GPU-enhanced cluster, Parallel Comput. 33, 685699, 2007.[29] 2D wave GPU implementation https://github.com/Arshynbek/2Dwave-GPU-implementation[30] R. W. Hockney. A fast direct solution of Poissons equation using Fourier analysis. Journal ofthe ACM , 12:1, 95113, 1965.[31] J. Nickolls, I. Buck, M. Garland, K.Skadron. Scalable parallel programming with cuda. Queue J. Comput. Phys. , 228: 21,78637882,2009. Arshyn Altybay:Al-Farabi Kazakh National University71 Al-Farabi avenue050040 AlmatyKazakhstanandDepartment of Mathematics: Analysis, Logic and Discrete MathematicsGhent University, Belgium andInstitute of Mathematics and Mathematical Modeling125 Pushkin str., Almaty, 050010Kazakhstan, E-mail address [email protected] Michael Ruzhansky:Department of Mathematics: Analysis, Logic and Discrete MathematicsGhent University, BelgiumandSchool of Mathematical SciencesQueen Mary University of LondonUnited Kingdom E-mail address [email protected] Niyaz Tokmagambetov:Department of Mathematics: Analysis, Logic and Discrete MathematicsGhent University, Belgiumandal–Farabi Kazakh National University71 al–Farabi ave., Almaty, 050040Kazakhstan,