[PDF] Development of a Flow Solver with Complex Kinetics on the Graphic Processing Units

Abstract

The current paper reports on the implementation of a numerical solver on the Graphic Processing Units (GPU) to model reactive gas mixture with detailed chemical kinetics. The solver incorporates high-order finite volume methods for solving the fluid dynamical equations coupled with stiff source terms. The chemical kinetics are solved implicitly via an operator-splitting method. We explored different approaches in implementing a fast kinetics solver on the GPU. The detail of the implementation is discussed in the paper. The solver is tested with two high-order shock capturing schemes: MP5 (Suresh, et al. (1997)) and ADERWENO (Titarev, et al. (2005)). Considering only the fluid dynamics calculation, the speed-up factors obtained are 30 for the MP5 scheme and 55 for ADERWENO scheme. For the fully-coupled solver, the performance gain depended on the size of the reaction mechanism. Two different examples of chemistry were explored. The first mechanism consisted of 9 species and 38 reactions, resulting in a speed-up factor up to 35. The second, larger mechanism consisted of 36 species and 308 reactions, resulting in a speed-up factor of up to 40.

Full PDF

aa r X i v : . [ phy s i c s . f l u - dyn ] O c t Development of a Flow Solver with Complex Kinetics on theGraphic Processing Units

Hai P. Le ∗ Department of Mechanical and Aerospace Engineering, UCLA, Los Angeles, CA

Jean-Luc Cambier † Air Force Research Laboratory, Spacecraft Branch, Edwards AFB, CA

August 29, 2018

Abstract

The current paper reports on the implementation of a numerical solver on the GraphicProcessing Units (GPU) to model reactive gas mixture with detailed chemical kinetics. Thesolver incorporates high-order ﬁnite volume methods for solving the ﬂuid dynamical equationscoupled with stiﬀ source terms. The chemical kinetics are solved implicitly via an operator-splitting method. We explored diﬀerent approaches in implementing a fast kinetics solver onthe GPU. The detail of the implementation is discussed in the paper. The solver is tested withtwo high-order shock capturing schemes: MP5 [11] and ADERWENO [12]. Considering onlythe ﬂuid dynamics calculation, the speed-up factors obtained are 30 for the MP5 scheme and55 for ADERWENO scheme. For the fully-coupled solver, the performance gain depended onthe size of the reaction mechanism. Two diﬀerent examples of chemistry were explored. Theﬁrst mechanism consisted of 9 species and 38 reactions, resulting in a speed-up factor up to 35.The second, larger mechanism consisted of 36 species and 308 reactions, resulting in a speed-upfactor of up to 40.

During the last seven years, the Graphic Processing Unit (GPU) has been introduced as apromising alternative to high-cost HPC platforms. Within this period, the GPU has evolvedinto a highly capable and low-cost computing solution for scientiﬁc research. Figure 1 illustratesthe superiority of GPU over the traditional Central Processing Unit (CPU) in terms of ﬂoatingpoint calculation. This is due to the fact that GPU is designed for graphic rendering, whichis a highly parallel process. Starting from 2008, the GPU began to support double precisioncalculation, which is desired for scientiﬁc computing. The newest generation of NVIDIA GPUscalled ”Fermi” has been designed to enhance the performance on double precision calculationover the old generation. The most popular programming environment for general purpose GPUcomputing, namely CUDA[10], has undergone several development phases and reached a certainlevel of maturity, which is essential for the design of numerical solvers. Several attempts had

Distribution A: Approved for public release; Distribution unlimited ∗ Research Assistant ([email protected]) † Senior Scientist T h e o r e t i c a l P e a k G F L O P / s NVIDIA GPU (Single)Intel CPU (Single)NVIDIA GPU (Double)Intel CPU (Double)Figure 1: Single and double precision ﬂoating point operation capability of GPU and CPU from2003-2010 (adapted from NVIDIA[10]) been made in writing scientiﬁc codes on the GPU, and promising results were obtained both interms of performance and ﬂexibility [1, 5, 9, 16].In this work, we attempt to follow a similar path on the design of numerical solvers on theGPUs. However, our focus is diﬀerent from the previous attempts such that we place moreattention to the kinetics solver than the ﬂuid dynamics. This is due to the fact that for thesimulation of high-speed ﬂuid ﬂow, the computation is dominated by solving the kinetics. Whilethe current implementation is only for chemical kinetics, we aim to extend it to a more complexand computationally intensive kinetics model for plasma.

The set of the Euler equations for a reactive gas mixture can be written as ∂Q∂t + ∇ · ¯ F = ˙Ω (1)where Q and F are the vectors of conservative variables and ﬂuxes. We assumed that there isno species diﬀusion and the gas is thermally equilibrium (i.e., all species have the same velocityand all the internal energy modes are at equilibrium). The right hand side (RHS) of equation(1) denotes the vector of source terms ˙Ω, which are composed here of exchange terms due tochemical reactions. We solve the system (1) in a ﬁnite-volume formulation, by applying Gauss’slaw to the divergence of the ﬂuxes: ∂Q∂t + 1 V I S F n dS = ˙Ω (2) here Q, F n now denote volume-averaged quantities, which can be written as: Q =  ρ s ... ρu x ρu y ρu z E  ; F n =  ρ s u n ... P n x + ρu n u x P n y + ρu n u y P n z + ρu n u z u n ( P + E )  where u n is the velocity vector normal to the interface and ( n x , n y , n z ) is the correspondingunit vector. The total energy is the sum of the internal energies from each species and the totalkinetic energy: E = X s ρ s e is + 12 ρ~u (3)Since the species formation energies are not included in that deﬁnition, we must account fortheir change in the source term ˙Ω. The hyperbolic terms and the source terms are solvedindependently of each other by making use of an operator-splitting technique [2]. ∂Q∂t = (cid:18) ∂Q∂t (cid:19) conv + (cid:18) ∂Q∂t (cid:19) chem = − V I S F n dS + ˙Ω (4)The equation of state (EOS) is that of an ideal gas, i.e. Dalton’s law of partial pressures: P = N RT = X s ρ s ( R/M s ) T (5)where R is the Boltzmann constant (in J/mol · K) and M s the species molar mass. The pressurecan also be determined from the conserved variables, { ρ s , ~m, E } , where ~m = ρ~u , by: P = ( γ − (cid:18) E − ~m ρ (cid:19) (6)This formulation allows us to compute the pressure derivatives with respect to the conservativevariables, needed for the ﬂux Jacobian. Comparing (5) and (6), we ﬁnd the expression for the eﬀective ratio of speciﬁc heats γ : γ = 1 + R T P s ρ s /M s ρ ¯ e i with ρ ¯ e i = X s ρ s e is (7)Using ∂γ∂ρ s (cid:19) E,m = RTρ ¯ e i (cid:18) M s − e is ¯ M ¯ e i (cid:19) (8)with ¯ M = ρ/N , we ﬁnd (using the notation P q a = ∂P/∂q a ): P ρ s = ( γ − ~u (cid:18) RTM s − RT ¯ M e is ¯ e i (cid:19) (9)Note that P s ρ s P ρ s ≡ ( γ − ~u /

2. The other derivatives are: P m α = − ( γ − u α α = x, y, z (10)and P E = γ − − j − j j + 1 j + 2 j + 3 u Rj + u Lj + Figure 2: Schematic of computational stencil with left and right states.

The speed of sound is deﬁned as c = X s ˆ c s P ρ s + ( h − ~u ) P E = γ Pρ (12)where ˆ c s = ρ s /ρ is the species mass fraction and h = Hρ = E + Pρ (13)is the speciﬁc enthalpy. A dimensional splitting technique [13] is utilized for solving the convective part of the governingequations. In order to achieve high-order both in space and time, we employed a ﬁfth-orderMonotonicity-Preserving scheme[11] (MP5) for the reconstruction, and a third-order Runge-Kutta (RK3) for time integration. For the MP5 scheme, the reconstructed value of the left andright states of interface j + is given as (see Fig. 2): u Lj + = 160 (2 u j − − u j − + 47 u j + 27 u j +1 − u j +2 ) (14a) u Rj + = 160 (2 u j +3 − u j +2 + 47 u j +1 + 27 u j − u j − ) (14b)The reconstructed values are then limited to avoid instability. u Lj + ← median (cid:16) u Lj + , u j , u MP (cid:17) (15)where u MP = u j + minmod [ u j +1 − u j , α ( u j − u j − )] (16)with α = 2.In addition to the MP5 scheme, we also considered the Arbitrary Derivative Riemann Solverusing the Weighted Essentially Non-Oscillatory reconstruction procedure, so-called ADER-WENO scheme [12]. At each interface of 1 one-dimensional stencil, we seek the solution ofthe generalized Riemann problem (GRP) ∂ t Q + ∂ x F ( Q ) = 0 (17)with the following initial conditions Q ( k ) ( x,

0) = ( q ( k ) L ( x ) if x < q ( k ) R ( x ) if x > he solution of equation (17) can be expanded using the Taylor Series expansion in time. Q ( x j +1 / , t + h ) = Q ( x j +1 / , t ) + r − X k =1 h k k ! ∂ k ∂t k Q ( x j +1 / , t ) (18)where all the temporal derivatives can be determined using the Cauchy-Kowalewski proce-dures [12]. The solution obtained in this form is high-order both in space and time, so aone-step time integration approach such as the Euler explicit method is adequate. This pro-vides certain advantages over the MP5 scheme since the overhead due to RK integration canbe avoided. One disadvantage however of the ADERWENO scheme is that the scheme is notguaranteed to be total variation diminishing (TVD) which might be an issue in the region ofstrong compression or expansion waves.The interface ﬂuxes are solved by employing the HLLE Riemann solver [4], which is given as F HLLE j +1 / = b + F R − b − F L b + − b − + b + b − b + − b − ∆ Q j +1 / (19)where b + = max(0 , ˆ u n + ˆ c, u nR + c R ) (20) b − = min(0 , ˆ u n − ˆ c, u nL − c L ) (21) An elementary chemical chreaction takes the form X s ν ′ r [ X s ] ⇔ X s ν ′′ r [ X s ] (22)where ν ′ r and ν ′′ r are the molar stoichiometric coeﬃcients of the reactants and products of eachreaction. The forward rate can be determined from K fr = A fr T β r exp (cid:18) − E r RT (cid:19) (23)The backward reaction rate is calculated from the equilibrium constant, which is given as K e = K fr K br = (cid:18) P a RT (cid:19) P s ν s exp (cid:18) − ∆ G RT (cid:19) (24)For each reaction, the progression rate can be written as Q r = X s α rs [ X s ] " K fr Y r [ X s ] ν ′ rs − K br Y r [ X s ] ν ′′ rs (25)The species net production rate can then be determined from˙ ω s = X r M s ν rs Q r (26)By conservation of mass, sum of all the species production rates should be equal to zero whichyields the following expression. X s ˙ ω s = 0 (27) n order to solve for the change in the species concentration through production and lossrate, one needs to know all the changes in the thermodynamics for each reaction as well as theirrates. In practice, the backward rate can also be computed using curve-ﬁtting technique withthe temperature as an input, but to be more rigorous, it is recomputed using the equilibriumconstant. All these quantities are read from separated data ﬁles which contain all the speciesinformation used for the computation along with the elementary reactions.The chemical kinetics is solved using a point implicit solver to ensure stability. The formu-lation can be obtained by using a Taylor series expansion in time of the RHS dQdt = ˙Ω + ∆ t ∂ ˙Ω ∂t (28)By applying chain rule to the time derivatives on the RHS, one could obtain I − ∆ t ∂ ˙Ω ∂Q ! dQdt = ˙Ω (29)Equation (29) is now a system of algebraic equations, which can be solved using a variety ofmethods. In the current work, a direct Gaussian elimination procedure is carried out in orderto solve for the linear system of the chemical kinetics. It must be pointed out that the Gaussianelimination procedure scales with ( N s ) where N s in this case is the number of species. Solvingthe system at every cell is clearly a computationally intensive task. The GPU processes data on Single-Instruction-Multiple-Thread (SIMT) fashion. The instruc-tion for executing on the GPU is called a kernel which is invoked from the host (CPU). TheCUDA programming model consists of grid and thread block . A grid consists of multiple threadblocks and each thread block contains a number of threads . When a kernel is called, the schedulerunit on the device will automatically assign a group of thread blocks to the number of avail-able streaming multi-processors (SM or GPU core) on the device. Once the SM has completedthe calculation, it will be assigned another block. Since there is no communication betweenthe thread blocks , the execution order is automatically optimized so GPU with more cores willperform the calculation faster. This is shown in Figure 3.The data parallelism is also inherent at the thread level. An instruction given to a threadblock is handled by a SM which contains a number of streaming processors (SP). All the threadswithin each block will be organized into groups of 32 threads called warps which are executedin a SIMT manner. The diﬀerence in the data parallelism between grid and thread block is thatthere is synchronization mechanism for all the threads in a same block but not for all the blocksin the grid. It is therefore important to ensure that there is no data dependency between threadblocks . The parallelization is done by directly mapping the computational domain to the CUDA grid.The face values can be mapped the same way with a larger grid since the number of faces ineach direction is always 1 greater than the number of cells in that direction. Each CUDA threadcan be associated with one cell/face inside the computational domain.In order to maximize the memory access eﬃciency, the data for the whole domain is stored asa one-dimensional array. The index of this array can be calculated from the dimensional indicesand the variable index and vice versa. This is similar to the approach taken in [16]. Since all thedata storage is one-dimensional, the domain can be decomposed into one-dimensional stencils of cells/faces values where each stencil can be assigned to a CUDA block. Since the computationaldomain can be up to three-dimensional, one can split the stencil diﬀerent ways. However, itis desired to split the stencil so that all the components of a stencil are located in contiguousmemory space. For example, if i is the fastest varying index of a three-dimensional data array A ( i, j, k ), the stencil is created by splitting the domain along the i direction. The directionalindices ( i, j, k ) of a three-dimensional domain with lengths IDIM , JDIM and

KDIM can becalculated from the thread index as follows: k = thread_Idx / (IDIM*JDIM)ij = thread_Idx mod (IDIM*JDIM)j = ij / IDIMi = ij mod IDIM

Each stencil now can be ﬁtted into a block of threads and each component of the stencil isassociated with a thread. Since all the threads within a block are accessing consecutive memoryaddress, the access pattern is coalesced resulting in high memory bandwidth. The calculationinside the kernel requires a certain amount of registers, especially for high-order schemes, sothe size of the stencil is only constrained by the size of the available registers in each warp.However, within that constraint, the size of the block can have an impact on the performanceof the kernel .It is usually recommended to maximize the block occupancy to make up for the memorylatency. Since the occupancy factor is proportional to the block size, a large block size wouldresult in high occupancy. However, as studied by Volkov [15], in the case where there aremultiple independent instructions in the kernel , it is more advantageous to make the block sizesmaller and utilize more registers to cover for the memory latency. One example in the case ofthe ﬂuid solver is the construction of the eigensystem. Since each entry of the eigensystem canbe constructed independently of the others, the kernel actually performs faster in the case ofsmaller block size. This is referred as Instruction-level Parallelism (ILP). More information onhow to optimize the performance of a kernel using ILP can be found in Ref. [15]. PSfrag replacements

32 cells/CUDA block1 cell/CUDA block M a t r i x S i ze ( K B ) Numbers of SpeciesShared Memory Limit0 20 40 60 80010203040506070

Figure 4: Chemistry size limit

The parallelization of the kinetics solver can greatly beneﬁt from GPU acceleration. The prob-lem can be described as a simple linear algebra problem Ax = b where A is the Jacobianmatrix mentioned in equation (29). The solution of the system contains the change in molarconcentration of all the species due to chemical reactions. The system is solved using a Gaus-sian elimination algorithm, which is basically a sequential method. Since the kinetics in eachcomputational cell are independent of other cells, one can parallelize the system on thread-per-cell basis. The Gaussian elimination process requires a lot of memory access for read andwrite instructions to use and modify the values of the Jacobian. We investigated here diﬀerentapproaches to maximize the performance of the kinetics solver.The ﬁrst approach is to store everything on global memory. Although global memory isthe slowest type of memory on the device, coalesced memory access can result in high memorybandwidth close to the theoretical limit. The advantage of this approach is that it is simpleand straight-forward. The global memory being the largest on the device, the restriction on thenumber of species can be relaxed. It must be noted that if the number of species is large, theGaussian method may suﬀer from accumulated round-oﬀ error, and double-precision is rapidlya necessity.It is usually recommended [10] to utilize shared memory whenever possible to reduce globalmemory traﬃc. In this case, solving the chemical system requires inverting a N s -by- N s matrixand the system needs to be solved at every computational cell. Storing the whole Jacobian andthe RHS vector on shared memory is not an ideal situation here. Figure 4 shows the memoryrequirement for storing the Jacobian and RHS on the typical shared memory (48 KB for aTesla C2050/2070). If we associated an entire thread block to the chemical system in a cell, thenumber of species is limited to 75. Storing more than 1 system per block makes this limit goeven lower, as can be seen for 1 thread per cell (32 threads block size). PSfrag replacements

Shared Memory (Reduced)Shared Memory (Full)Global Memory M a x i m u m N u m b e r s o f Sp ec i e s Numbers of ElementsShared Mem. LimitShared Mem. Limit10 Figure 5: Chemistry size limit

In order to overcome the shared memory limit, we considered storing only two rows of theJacobian in shared memory since the sequence of the elimination is done row-by-row. For eachrow elimination, one need to store values for that row and the pivot row. This is shown in Figure5 where the species limit is now much higher than for the full storage pattern. The draw-back ofthis approach, however, is that there are multiple memory transfers between global and sharedmemory, since we are required to copy back the values of each row after being eliminated. Theparallelization is only eﬀective when the calculation is dominant, so that it would make up forthe memory transfer. It will be shown later that this algorithm is only fast for a large numberof species.

The ﬁrst objective is to verify that the solver is correctly implemented using the CUDA kernels.For this purpose, we can compare the results with a pure-CPU version, but also compute a setof standard test cases. The ﬁrst of those is a Mach 3 wind tunnel problem (a.k.a. the forwardstep problem) using the MP5 scheme, whose solution shown in Figure 6. This problem had beenutilized by Woodward and Colella[17] to test a variety of numerical schemes. The whole domainis initialized with Mach-3 ﬂow and reﬂective boundary condition are enforced on the step andthe upper part of the domain. The left and the right boundary conditions are set as in-ﬂow andout-ﬂow, respectively. Special attention is usually required at the corner of the step since thisis a singular point of the ﬂow which can create numerical instabilities. Woodward and Colellatreated this by assuming the ﬂow near the corner is nearly steady. However, this artiﬁcial ﬁxwas not used in this simulation since we want to test the robustness of the solver in the case of strong shocks and how it handles the singularity in wall curvature, responsible for very strongexpansion.The second test involves a similar problem of a diﬀraction of a shock wave ( M = 2 .

4) downa step[14]. The strong rarefaction at the corner of the step can cause a problem of negativedensity when performing the reconstruction. The problem is modeled here using 27,000 cells,and the numerical simulation is shown in pair with the experimental images in ﬁgure 7. Thesolver was able to reproduce the correct ﬂow features with excellent accuracy.We also modeled the Rayleigh-Taylor instability problem [6]. The problem is described asthe acceleration of a heavy ﬂuid into a light ﬂuid driven by gravity. For a rectangular domainof (0 . × ρ = 2 , u = 0 , v = − .

025 cos(8 πx ) , P = 2 y + 1 for 0 ≤ y ≤ ρ = 1 , u = 0 , v = − . c cos(8 πx ) , P = y + 32 for 12 ≤ y ≤ c is the speed of sound.The top and bottom boundaries are set as reﬂecting and the left and right boundaries areperiodic. As the ﬂow progresses, the shear layer starts to develop and the Kelvin-Helmholtzinstabilities become more evident. The gravity eﬀect is taken into account by adding a sourceterm vector which modiﬁes the momentum vector and the energy of the ﬂow based on gravita-tional force. The source term in this case is relatively simple and contributes very little to theoverall computational time. The performance of the ﬂuid dynamics calculation is discussed inthe next section of this report.We now turn the attention to the modeling of a reactive ﬂow ﬁeld. We simulated a spark-ignited detonation wave both in one- and two-dimension to demonstrate the capability of thesolver. At a well-resolved scale, the detonation wave can be described as a strong shock wavesupported by the heat release from a high-temperature ﬂame behind an induction zone. Inter-esting features have been observed both in the 1-D and 2-D simulations, characterized by thecoupling of the ﬂuid dynamics and chemical kinetics. The study of ﬂame-shock coupling is anon-going research topic and certainly can be aided with GPU computing when the evolution ofthe detonation wave needs to be resolved at a very ﬁne spatial scale.The evolution of the pressure and temperature of the two-dimensional ﬂow ﬁeld is shown inFigure 9. The ﬂow is modeled using 9 species gas mixture with 38 reactions. The mechanismused for the simulation is shown in Appendix A. The computational domain is rectangular witha length of 7.5 cm. The grid spacing in both directions is 50 µ m. The detonation cells, between × the shock and the multiple triple points in transverse motion, is clearly seen. This well-knowncellular structure has been observed both in experiments and numerical simulations. We willshow in the next section how the superior performance of the GPU can enhance our ability inmodeling reactive ﬂows. Figure 10 shows the performance of the solver for the simulation considering only the ﬂuiddynamics aspect using the MP5 and the ADERWENO schemes. The speed-ups obtained inboth cases are very promising. Since the ADERWENO scheme only requires single-stage timeintegration, it is faster than the MP5 scheme. For the ADERWENO scheme, we can obtainalmost 60 times speed-up for a large grid which is about twice faster than the MP5 scheme.All the comparisons are made between a Tesla C2070 and an Intel Xeon X5650 (single thread)using double precision calculation.The performance of the kinetics solver depends strongly on the memory access eﬃciency.Since the Gaussian elimination algorithm requires issuing a large amount of memory instructions(both

Read and

Write ) to modify all the entries of the Jacobian, it is important to achievehigh memory bandwidth while maintaining suﬃcient independent arithmetic operations to hidememory latency. This factor is very crucial in the case when the whole Jacobian is storedinside global memory (DRAM) since the DRAM latency is much higher than shared memory.Figure 11 illustrates the memory access eﬃciency of the GPU kernel performing the Gaussianelimination procedure. It is clearly shown that coalesced memory access results in much highermemory bandwidth comparing to the non-coalesced pattern for the same number of operations.The memory bandwidth recorded for the kernel is up to 80% of the theoretical peak limit ofthe device (144 GB/s for a Tesla C2050). The test is performance for a number of species up to200. The result shown in ﬁgure 11 indicates that the global memory access pattern implementedin the ﬁrst approach is very eﬃcient. In the second approach, we utilized shared memory tocompensate for DRAM latency. However, due to the memory intensive nature of the chemicalkinetics problem and the limitation of shared memory storage, there is a substantial amount ofDRAM access which cannot be avoided. The memory bottleneck introduces addition memorylatency which can aﬀect the performance of the kernel.Figure 12 shows the performance of the kinetics solver only (i.e., no convection). Althoughthe construction of the Jacobian and chemical source terms can also be a very time consumingprocess, we are interested here in the most computationally intensive part, i.e. the Gaussianelimination and its scaling. Hence, the performance is measured by solving a number of linearsystems A · x = b with diﬀerent system sizes ( N s ) and grid sizes ( N cell ).We described two diﬀerent implementation approaches in our earlier discussion. The ﬁrstapproach is to store everything on global memory and try to achieve high memory bandwidth PSfrag replacements

ADERWENOMP5 Sp ee d - up Numbers of Elements10 Figure 10: Performance of the ﬂuid dynamics simulation only

PSfrag replacements

Theoretical PeakNon-coalescedCoalesced M e m o r y B a nd w i d t h ( G B / s ) Numbers of Species0 50 100 150 200050100150200

Figure 11: Memory bandwidth of the kinetics solver measured on a Tesla C205014

PSfrag replacements

Global MemoryShared Memory Sp ee d - up Numbers of Species 32 × ×

256 128 ×

128 64 × × × × Figure 12: Comparison of the speed-up factor obtained from the kinetics solver using both globaland shared memory by coalesced memory reads. The second approach is to transfer memory to shared memory foreach row elimination. Figure 12 clearly shows that the global version outperforms the sharedmemory version in all cases. Since the shared memory approach requires additional memorytransfers between each row elimination, it is only eﬀective when N s is large. It is shown inthe plot that the algorithm is only eﬀective when N s > ×

512 grid of cells for up to 25species, then a 128 ×

128 grid was possible up to 200 species, etc. The intense memory usagecomes from the need to store a N s × N s Jacobian for each computational cell (e.g. this variablealone adds-up to more than 5 GB of global memory for 200 species on the 128 ×

128 grid).This can be alleviated by storing only the Jacobian for a segment of the domain; for example,the kernel for a 512 ×

128 grid can be called 4 times to solve the kinetics of a 512 ×

512 grid.The constraint for this approach is that the segment itself must be suﬃciently large, i.e. of theother of the number of streaming processors in the GPU, and that the overhead for each kernelcall is small. This is shown in Figure 13 where the calculation is performed for a grid with aﬁxed size of 128 ×

128 with 100 species and varying number of segments. In this test case, thediﬀerence in execution time is negligible up to 4 segments. When the domain is further dividedin 8 segments, the size of each segment gets smaller (2048 cells) and this approach is no longereﬀective.Figure 14 shows the performance of the ﬂow solver coupling with the chemical kinetics.The result shown in the plot is for a 9-species gas (hydrogen-air) with a mechanism consisting E ﬃ c i e n c y Numbers of Segments1 2 3 4 5 6 7 800 . . . . . Figure 13: Kinetics calculation of a 128 ×

128 grid with 100 species. The calculation is done bybreaking the computational domain into numbers of segments. of 38 reactions (both forward and backward). MP5 scheme is utilized in all cases. AlthoughADERWENO scheme has shown to be faster than MP5, we do not expect to see signiﬁcantdiﬀerence in the performance because the computation time is dominated by chemical kinetics.The overall performance should be greatly dependent on the performance of the kinetics solver.It is clear that the global memory version outperforms the shared memory version. This isconsistent with the results obtained earlier for the ﬂuid dynamics and the kinetics separately.As the the grid size increases, the speed-up factor obtained for the shared memory version ofthe kinetics solver does not change rapidly. In contrast, the global memory version results in a40 times speed-up for a large grid.We extend the simulation to model a larger mechanism with 36 species and 308 reactions(reduced hydrocarbon). The performance is compared with previous result of hydrogen-air. Asillustrated in Figure 15, the result for the reduced hydrocarbon mechanism is faster than thehydrogen-air, presumably due to better performance in the construction of the chemical kineticsJacobian.

In the current paper, we described the implementation of a numerical solver for simulating chem-ically reacting ﬂow on the GPU. The ﬂuid dynamics is modeled using high-order shock-capturingschemes, and the chemical kinetics is solved using an implicit solver. Results of both the ﬂuiddynamics and chemical kinetics are shown. Considering only the ﬂuid dynamics, we obtaineda speed-up of 30 and 55 times compared to the CPU version for the MP5 and ADERWENOscheme, respectively. For the chemical kinetics, we presented two diﬀerent approaches on im-plementing the Gaussian elimination algorithm on the GPU. The best performance obtained by PSfrag replacements

Global MemoryShared Memory9 species, 38 reactions Sp ee d - up Numbers of Elements10 Figure 14: Performance of the reactive ﬂow solver for a 9-species gas

PSfrag replacements

36 species, 308 reactions9 species, 38 reactions Sp ee d - up Numbers of Elements10 Figure 15: Performance of the reactive ﬂow solver for two diﬀerent chemistry mechanism17 olving the kinetics problem ranges from 30-40 depending on the size of the reaction mechanism.When the ﬂuid dynamics is coupled with the kinetics, we obtained a speed-up factor of 40 timesfor a 9-species gas mixture with 38 reactions. The solver is also tested with a larger mechanism(36 species, 308 reactions) and the performance obtained is faster than the small mechanism.The current work can be extended in diﬀerent ways. First, since the framework is perform-ing well in shared memory architecture, it is possible to also extend it to distributed memoryarchitecture utilizing Message Passing Interface (MPI). The extension permits using multi-GPUwhich is attracted for performing large-scale simulations. On the other hand, although the cur-rent simulation is done for chemically reacting ﬂow, it is desired to extend it to simulate ionizedgas (i.e. plasma) which requires modeling additional physical process (Collisional-Radiative)to characterize diﬀerent excitation levels of the charged species. In adidition, the governingequations also need to be extended to characterize the thermal non-equilibrium environment ofthe plasma. Given that the physics has been well established[7, 8, 3], the extension is certainlyachievable. eferences [1] Brandvik, T., and Pullan, G.

Acceleration of a 3D Euler Solver using CommodityGraphics Hardware. In (2008), AIAA paper 08-607.[2]

Cambier, J.-L., Adelman, H. G., and Menees, G. P.

Numerical Simulations ofOblique Detonations in Supersonic Combustion Chamber.

J. of Prop. and Power 5 , 4(1989), 481–491.[3]

Cambier, J.-L., and Moreau, S.

Simulations of a Molecular Plasma in Collisional-Radiative Nonequilibrium. AIAA paper 93-3196.[4]

Einfeldt, B., Munz, C. D., Roe, P. L., and Sj¨ogreen, B.

On Godunov-TypeMethods Near Low Densities.

J. Comp. Phys. 92 , 2 (1991), 273–295.[5]

Elsen, E., LeGresley, P., and Darve, E.

Large Calculation of the Flow Over aHypersonic Vehicle Using a GPU.

J. Comp. Phys. 227 , 24 (2008), 10148–10161.[6]

Gardner, C. L., Glimm, J., McBryan, O., Menikoff, R., Sharp, D. H., andZhang, Q.

The Dynamics of Bubble Growth for Rayleigh-Taylor Unstable Interfaces.

Physics of Fluid 31 , 3 (1988), 447–465.[7]

Kapper, M. G., and Cambier, J.-L.

Ionizing Shocks in Argon. Part I: Collisional-Radiative Model and Steady-State Structure.

Journal of Applied Physics 109 , 11 (2011),113308.[8]

Kapper, M. G., and Cambier, J.-L.

Ionizing Shocks in Argon. Part II: Transient andMulti-Dimensional Eﬀects.

Journal of Applied Physics 109 , 11 (2011), 113309.[9]

Klockner, A., Warburton, T., Bridge, J., and Hesthaven, J.

Nodal DiscontinuousGalerkin Methods on Graphics Processors.

J. Comp. Phys 228 , 21 (2009), 7863 – 7882.[10]

NVIDIA Corporation . Compute Uniﬁed Device Architecture Programming Guide ver-sion 4.0 . 2011.[11]

Suresh, A., and Huynh, H. T.

Accurate monotonicity-preserving schemes with runge-kutta time stepping.

J. Comp. Phys. 136 (1997), 83–99.[12]

Titarev, V. A., and Toro, E. F.

ADER schemes for three-dimensional nonlinearhyperbolic systems.

J. Comp. Phys 204 , 2 (2005), 715–736.[13]

Toro, E. F.

Riemann solvers and numerical methods for ﬂuid dynamics - A practicalintroduction - 2nd edition . Springer, 1999.[14]

Van Dyke, M.

An Album of Fluid Motion . Parabolic Press, Inc., 1989.[15]

Volkov, V.

Better Performance at Lower Occupancy. In

GPU Technology Conference (2010).[16]

Wong, H.-C., Wong, U.-H., Feng, X., and Tang, Z.

Eﬃcient MagnetohydrodynamicSimulations on Graphics Processing Units with CUDA.

Computer Physics Communications182 , 10 (2011), 2132 – 2160.[17]

Woodward, P., and Colella, P.

The numerical simulation of two-dimensional ﬂuidﬂow with strong shocks.

J. Comp. Phys 54 , 1 (1984), 115–173. ppendix A Reaction Mechanism for Hydrogen-Air Detonation1 H + O ↔ O + OH O + H ↔ H + OH H + OH ↔ H + H O OH + OH ↔ H O + O H + OH + M ↔ H O + M H + H + M ↔ H + M H + O + M ↔ OH + M O + M ↔ O + M H + O ↔ HO + H H + O + M ↔ HO + M H + HO ↔ OH + OH H + HO ↔ O + H O O + HO ↔ O + OH OH + HO ↔ O + H O H O + M ↔ OH + OH + M HO + HO ↔ H O + O H + H O ↔ H + HO O + H O ↔ OH + HO OH + H O ↔ H O + HO2