A scalable multi-GPU method for semi-implicit fractional-step integration of incompressible Navier-Stokes equations
AA scalable multi-GPU method for semi-implicitfractional-step integration of incompressibleNavier-Stokes equations
Sanghyun Ha, Junshin Park and Donghyun You Department of Mechanical Engineering, Pohang University of Science and Technology,77 Cheongam-ro, Nam-gu, Pohang, Gyeongbuk 37673, Republic of Korea
Abstract
A new flow solver scalable on multiple Graphics Processing Units (GPUs)for direct numerical simulation of wall-bounded incompressible flow is pre-sented. This solver utilizes a previously reported work [4] which proposes asemi-implicit fractional-step method on a single GPU. Extension of this workto accommodate multiple GPUs becomes inefficient when global transposeis used in the Alternating Direction Implicit (ADI) and Fourier-transform-based direct methods. A new strategy for designing an efficient multi-GPUsolver is described to completely remove global transpose and achieve highscalability. Parallel Diagonal Dominant (PDD) and Parallel Partition (PPT)methods are implemented for GPUs to obtain good scaling and preserve ac-curacy. An overall efficiency of 0.89 is shown. Turbulent flat-plate boundarylayer is simulated on 607M grid points using 4 Tesla P100 GPUs.
Keywords:
GPU, Boundary layer, Direct numerical simulation, Corresponding author. E-mail: [email protected]; Phone: +82-54-279-2191; Fax:+82-54-279-3199
Preprint submitted to Journal of Computational Physics December 5, 2018 a r X i v : . [ phy s i c s . c o m p - ph ] N ov emi-implicit fractional-step method, Parallel Diagonal Dominant method
1. Introduction
Turbulent and transitional boundary layers are comprised of a variety ofscales. Such broadband scales can be captured by direct numerical simula-tion (DNS) which provides high-resolution data. A major challenge in DNSof such wall-bounded flows is the heavy requirement in domain length andgrid size which necessitates a significant amount of computational resources.Therefore the choice of an efficient numerical scheme and algorithms for itsparallelization is critical in the study of boundary layers using DNS.A commonly used method for spanwisely periodic wall-bounded flowsis the semi-implicit fractional-step method with a second-order spatial dis-cretization. Among many variations of this method, a classic version solvesthe momentum equation using Alternating Direction Implicit (ADI) method,followed by the Poisson equation which is solved directly using Fourier-transform [6]. Since there are no iterations involved, this is one of the mostefficient methods for solving incompressible flow. However in parallel com-puting, these algorithms do not easily scale on multiple processors due totheir inherently serial nature.In a recent work [4], we have proposed a parallel implementation of thesemi-implicit fractional-step method for Graphics Processing Units (GPUs),which represent hardwares based on a massively parallel architecture. Majordifficulties coming from the serial nature of the fractional-step method have2een analyzed and overcome, achieving up to 48 × speedup on 134M gridcells using a single Tesla P100 GPU. Yet, this work was limited to a singleGPU which could only afford low to moderate Reynolds numbers.To use this method at high Reynolds numbers, the use of multiple GPUsis inevitable. Unlike single-GPU programming which focuses on fine-grainedparallelism, multi-GPU programming focuses on domain decomposition anddata distribution at a coarser-level. When distributing jobs to multipleGPUs, we are interested in selecting a domain decomposition method thatnot only minimizes communication between GPUs but also allows coalescedaccess of the global memory. Note that the ADI and Fourier-transform baseddirect methods make frequent use of matrix transpose to efficiently accessdata in each direction. The problem is that matrix transpose becomes aglobal all-to-all operation when data is distributed on multiple GPUs. Manystudies such as [1], [2], [7] have used global transpose in simulations of wall-bounded flows. They have obtained a weakly linear scaling on thousandsof CPU cores, but reported that global transpose takes up a majority ofthe total computation time. Similar characteristics are reported in a re-cently developed GPU code [13] whose performance depends mainly on theglobal transpose of the pressure solver. It will be shown later that the costof global transpose becomes even worse when applied to the present semi-implicit fractional-step method.The present study aims to extend the classic fractional-step method tomultiple GPUs. The goal is to completely remove global transpose from the3PU algorithm, and achieve high scalability. To do so, the present studyemploys divide-and-conquer algorithms called Parallel Diagonal Dominant(PDD) and Parallel Partition (PPT) methods which are newly implementedto suit for GPUs. The paper is organized as follows: in Section 2, numericalmethods used to discretize the governing equations are described. In Section3, strategies for GPU implementation are explained. In Section 4, resultsfrom numerical experiments of the flow solver are provided with performanceanalyses. Concluding remarks follow in Section 5.
2. Numerical methods
Numerical methods are identical to those used in the previous work [4].Here, we offer a brief explanation which is most relevant to the present study.
The non-dimensionalized incompressible Navier-Stokes equations are writ-ten as ∂u i ∂x i = 0 , (1) ∂u i ∂t + ∂∂x j u i u j = − ∂p∂x i + 1 Re ∂∂x j ∂u i ∂x j , (2)where Re is the Reynolds number based on a characteristic length scale.Non-dimensional variables u i and p represent velocity in the i -direction andpressure, respectively. Three-dimensional staggered structured grid topologyis used in which all velocity components are stored at cell faces, and the4ressure values at the center of each cell. Simulation of flow over a flat plateis modeled on a rectangular box (Fig. 1). Uniform grid spacings are employedin the streamwise x - and spanwise z -directions, respectively, while the gridis clustered near the wall in the wall-normal y -direction. A no-slip conditionis imposed at the bottom wall at y = 0, and a stress-free condition at thetop boundary. Convective boundary condition is applied at the outlet, whileturbulent inflow is created using a recycling method [8]. The above equations are solved by a semi-implicit fractional-step methodin which the convection terms of the momentum equation are integrated ex-plicitly in time using a low-storage third-order Runge-Kutta scheme, whilethe viscous terms are integrated implicitly using Crank-Nicolson scheme [5].Spatial discretization is performed using second-order central difference. Themomentum equation is approximated using ADI method, which producesthree tridiagonal matrices for each velocity component. The Poisson equa-tion is solved directly using half-range cosine transform in the streamwise x -direction and Fourier transform in the spanwise z -direction. Complex-numbered tridiagonal matrices in the wall-normal y -direction are inverted,after which the pseudo-pressure φ is obtained via inverse transforms.
3. GPU implementation
In the present fractional-step method, equations (both the momentumand Poisson) are solved in one direction at a time. This means that data5rientation should also be changed whenever there is a change in the direc-tion. Transpose operations play a critical role in communication betweenGPUs and the coalesced access of the global memory within each GPU. Inthis section, the cost of global transpose is first investigated in relation tothe overall performance. Then, a different domain decomposition suitablefor the present method is proposed. The present study uses Message PassingInterface (MPI) such that a 1-1 mapping between a GPU and an MPI rankis established.
For the purpose of testing global transpose, a one-dimensional domaindecomposition in the spanwise direction is implemented. Matrices are trans-posed three times in the ADI method and four times in solving the Poissonequation (Fig. 2) as listed below. Among these, four need to be transposedin an all-to-all manner which are marked as ’ALL-TO-ALL’ in Fig. 2 and’global’ in parentheses below. Detailed implementation of local/global trans-pose on GPUs is based on the algorithms in [10]. • Momentum equation1. Transpose x-orientation to z-orientation for z-directional ADI (global)2. Transpose z-orientation to y-orientation for y-directional ADI (global)3. Transpose y-orientation to x-orientation for x-directional ADI • Poisson equation 6. Transpose x-orientation to z-orientation for complex-to-complexFourier transform (global)2. Transpose z-orientation to y-orientation for inversion of complex-numbered tridiagonal matrices3. Transpose y-orientation to z-orientation for complex-to-real in-verse Fourier transform4. Transpose z-orientation to x-orientation for complex-to-complexinverse half-range cosine transform (global)The cost of global transpose is shown in Fig. 3 along with other majorparts of the momentum and Poisson equations. Computation time is mea-sured on 675M grid points using 4 Tesla P100 GPUs. In both the momentumand Poisson equations, time taken to perform all-to-all communication farexceeds the main computation time such as tridiagonal matrix (TDMA) in-version or fast Fourier transform (FFT). As a result, global transpose takesup about 46% of the entire computation time at each time-step, which makesit impractical.It should also be noted that all-to-all communication is more expensive inthe ADI method compared to that of the Poisson equation. In many studiesusing semi-implicit fractional-step methods, only the wall-normal diffusionterm of the momentum equation is integrated implicitly. Then by orient-ing the decomposed sub-domains in the wall-normal direction, tridiagonalmatrices can be solved without all-to-all communication. This ensures thatglobal transpose occurs only for FFT in the Poisson equation. Although this7ethod has less communication overhead, global transpose is still the mainsource of reduced scalability.
Consider a one-dimensional domain decomposition in the wall-normal y -direction. In this type of decomposition, FFT in x - and z -directions canbe computed without global transpose, since all data resides locally in eachGPU (Fig. 4). However data required for y -directional TDMAs is now dis-tributed across different GPUs. Rather than using global transpose to collectthem, two methods are employed to directly solve tridiagonal systems in par-allel: the Parallel Diagonal Dominant (PDD) and Parallel Partition (PPT)methods.PDD and PPT methods have been first proposed by Sun et al. [11] tosolve TDMAs distributed across multiple processors. It is suited for coarse-grained parallel machines for which the number of processors is usually lessthan the dimension of the matrix n . Here, the basic idea of the algorithmis described with a specific example where n = 12 and the number of GPUs p = 4. For a more general and detailed derivation of this method, refer to[11].A tridiagonal matrix A = [ a j , b j , c j ] ( j = 1 , , · · · , n ) can be decomposedinto a block-tridiagonal matrix ˜ A and remaining corner elements ∆ A . A = ˜ A + ∆ A .For this example, ∆ A is written as 8 A = · · · · · · · · · · · ·· · · · · · · · · · · ·· · · c · · · · · · · ·· · a · · · · · · · · ·· · · · · · · · · · · ·· · · · · · c · · · · ·· · · · · a · · · · · ·· · · · · · · · · · · ·· · · · · · · · · c · ·· · · · · · · · a · · ·· · · · · · · · · · · ·· · · · · · · · · · · · . By re-writing ∆ A as ∆ A = V E T , the original matrix can be written as A = ˜ A + V E T , where V E T = · · · · · ·· · · · · ·· c · · · · a · · · · ·· · · · · ·· · · c · ·· · a · · ·· · · · · ·· · · · · c · · · · a ·· · · · · ·· · · · · · · · · · · · · · · · ·· · · · · · · · · · ·· · · · · · · · · · ·· · · · · · · · · · ·· · · · · · · · · · ·· · · · · · · · · · · We are interested in finding the solution x of the system Ax = d .9his can be computed by finding the inverse of A = ˜ A + V E T , which is givenby the Sherman-Morrison matrix identity in Eq. (3). (cid:16) ˜ A + V E T (cid:17) − = ˜ A − − ˜ A − V (cid:16) I + E T ˜ A − V (cid:17) − E T ˜ A − , (3) x = A − d = ˜ A − d − ˜ A − V (cid:16) I + E T ˜ A − V (cid:17) − E T ˜ A − d. (4)Since ˜ A is block-tridiagonal, each block can be stored in each GPU. Thus˜ A − d and ˜ A − V can be computed by solving the following equations locallyon independent GPUs: ˜ A ˜ x = d, (5)˜ AY = V, (6)where Y is written in the form of Y = · w (0)1 · · · ·· w (0)2 · · · ·· w (0) m · · · · v (1)1 · · w (1)1 · · v (1)2 · · w (1)2 · · v (1) m · · w (1) m · ·· · v (2)1 · · w (2)1 · · v (2)2 · · w (2)2 · · v (2) m · · w (2) m · · · · v (3)1 ·· · · · v (3)2 ·· · · · v (3) m · Here, m = n/p = 3. The superscript denotes the MPI rank or the GPUindex ranging from 0 to p −
1. From Eq. (4), let Z = I + E T ˜ A − V which is10 five-banded 2( p − × p −
1) matrix of the form Z = w (0) m · · · v (1)1 w (1)1 · · v (1) m w (1) m ·· v (2)1 w (2)1 · · v (2) m w (2) m · · · v (3)1 .By solving the following system for some y , Zy = E T ˜ x, (7)we finally obtain the solution x x = ˜ x − Y y. (8)If we instead use a permutation matrix P = P − of the form P = · · · · · · · · · ·· · · · ·· · · · ·· · · · · · · · · · , then Eq. (4) becomes 11 = ˜ A − d − ˜ A − V P (cid:16) P + E T ˜ A − V P (cid:17) − E T ˜ A − d .Note that such a permutation has produced a tridiagonal 2( p − × p − Z of the form Z = P + E T ˜ A − V P = w (0) m · · · · v (1)1 w (1)1 · · ·· v (1) m w (1) m · ·· · v (2)1 w (2)1 ·· · · v (2) m w (2) m · · · · v (3)1 By solving the following equations˜ AY = V P, (9) Zy = E T ˜ x, (10)we finally obtain the solution x x = ˜ x − Y y. (11)For a strictly diagonal dominant TDMA whose diagonal elements at the j -th row satisfy | b j | > | a j | + | c j | ,the off-diagonal elements of the Z matrix, v ( i ) m and w ( i )1 converge to zero when n (cid:29) p . Then the matrix Z can be approximated as a block-diagonal matrix12ith 2 × Z matrix on each GPU. PDD method for the momentum equation
For the present study, PDD method is used in solving the momentumequation along the wall-normal y -direction. This is possible because tridiag-onal matrices resulting from the ADI method have a strictly diagonal domi-nant property such that | b j | = | a j | + | c j | + 1.This ensures that the solution of the momentum equation from the PDDmethod matches the exact solution within machine accuracy.Note that fine-grained parallelism is essential when using this method onGPUs. The PDD method establishes a scalable domain decomposition at thecoarse level, but its performance depends on how the tridiagonal systems ofEqs. (5) and (6) are solved. To do so, we utilize the 4-level parallelism used13n [4] and extend this up to 5 levels by batching Eq. (6). A hybrid CyclicReduction (CR) + Parallel Cyclic Reduction (PCR) algorithm [12] is usedwhich is provided in the cuSPARSE library as cusparseDgtsv nopivot [9].Details are described in Algorithm 1. PPT method for the Poisson equation
For the present study, PPT method is used in solving y -directional TD-MAs of the Poisson equation. Its major diagonal, b is made of off-diagonals, a + c plus the modified wavenumbers coming from the half-cosine transformin the x -direction followed by the Fourier transform in the z -direction. Thusthe matrices may have only a slight diagonal dominance of | b j | = | a j | + | c j | + (cid:15) with a small (cid:15) depending on the size of modified wavenumbers. The authorshave found that (cid:15) may easily fall down to O (10 − ) ∼ O (10 − ) for which thePDD method has given inaccurate results.Similar to the PDD method, it is important to use fine-grained parallelismwhen solving Eqs. (5) and (9). Methods used to solve the Poisson equation in[4] are employed in which a parallel tridiagonal solver with diagonal pivotingis used [3]. MPI ALLGATHER is used to collect data for configuring the Z matrix in each GPU.
4. Performance results
Numerical experiments are conducted to evaluate the scalability of thepresent multi-GPU solver. The GPU code runs on an IBM Power System14822LC for High Performance Computing. This server has two octa-corePower8 CPUs and four Tesla P100 GPUs with NVLink interconnect. Thecode is compiled with an -O2 optimization of the PGI Fortran Compilerversion 18.4. Performance is tested in simulations of a flat-plate boundarylayer whose boundary conditions are given in section 2.Scaling of the main components of the semi-implicit fractional-step methodis shown for 4 GPUs in Fig. 5. Speedup has been measured on 4096 × ×
128 = 134M cells. In the y -directional domain decomposition, the right-handside momentum equation (RHS), the ADI method in the x -direction (ADI-X)and the z -direction (ADI-Z), and FFT in x and z directions are computedindependently on each GPU without communication. Thus, strong scalinghas been obtained as expected. The more interesting part is the performanceof the ADI method in the y -direction (ADI-Y), and inversion of the complex-numbered TDMA of the Poisson equation (TDMA-C), for which PDD andPPT methods are applied, respectively. Thanks to the small communicationcost of the PDD method, ADI-Y scales very well on multiple GPUs. Giventhat the ADI method is the main bottleneck of the present fractional-stepmethod, the PDD method has drastically increased the overall scalabilityof the solver. On the contrary, TDMAs of the Poisson equation have weakscaling properties. This is attributed to the all-gather communication of thePPT method which is shown to take up more than half of the total timetaken to invert the TDMAs (Fig. 5(b)). However note that this communi-cation cost represents 10% of the total time, which is much less than the15ost required for global transpose that usually amounts to 30% ∼ Re θ = 1000 has been simulated (Fig. 7). For afixed CFL=1.0, the average time-step size was 0.022, and it took roughly 2days to advance a flow-through time.
5. Conclusion
A multi-GPU solver using the semi-implicit fractional-step method is de-veloped for DNS of wall-bounded incompressible flow. Global transpose re-quired for extending the ADI and Fourier-transform based direct methodsto multiple GPUs is found to be impractical. A one-dimensional domaindecomposition in the wall-normal y -direction is proposed, which allows us to16ompute FFT and ADI method in x and z directions locally on each GPUwithout communication. Systems of y -directional TDMAs distributed acrossmultiple GPUs are solved by implementing PDD and PPT methods in aGPU-friendly way. An algorithm for maximizing GPU workload is provided,which combines the coarse-grained parallelism of the PDD method and thefine-grained parallelism of individual TDMAs. The momentum equation withthe PDD method shows a strong scaling while the Poisson equation with thePPT method shows a weak one. An overall efficiency of 0.89 is obtainedfor 4 GPUs. A turbulent flat-plate boundary layer has been simulated on607M grid points using only 4 × P100 GPUs of a single node, which shows apromising potential for large-scale DNS on GPU clusters.
Acknowlegements
This research was supported by the Samsung Research Funding Center ofSamsung Electronics (SRFC-TB1703-01) and National Research Foundationof Korea grant funded by the Korea government (NRF-2017R1E1A1A03070514).17 eferences [1] Abide, S., Viazzo, S., Raspo, I., Randriamampianina, A., 2018. Higher-order compact scheme for high-performance computing of stratified ro-tating flows. Computers & Fluids 174, 300–310.[2] Borrell, G., Sillero, J. A., Jim´enez, J., 2013. A code for direct numericalsimulation of turbulent boundary layers at high reynolds numbers inBG/P supercomputers. Computers & Fluids 80, 37–43.[3] Chang, L.-W., Stratton, J. A., Kim, H.-S., Hwu, W.-M. W., 2012. Ascalable, numerically stable, high-performance tridiagonal solver usingGPUs. In: Proceedings of the International Conference on High Perfor-mance Computing, Networking, Storage and Analysis. IEEE ComputerSociety Press, p. 27.[4] Ha, S., Park, J., You, D., 2018. A GPU-accelerated semi-implicitfractional-step method for numerical solutions of incompressible Navier-Stokes equations. Journal of Computational Physics 352, 246–264.[5] Hahn, S., Je, J., Choi, H., 2002. Direct numerical simulation of turbulentchannel flow with permeable walls. Journal of Fluid Mechanics 450, 259–285.[6] Kim, J., Moin, P., 1985. Application of a fractional-step methodto incompressible Navier-Stokes equations. Journal of ComputationalPhysics 59 (2), 308–323. 187] Lee, M., Malaya, N., Moser, R. D., 2013. Petascale direct numericalsimulation of turbulent channel flow on up to 786k cores. In: High Per-formance Computing, Networking, Storage and Analysis (SC), 2013 In-ternational Conference for. IEEE, pp. 1–11.[8] Lund, T. S., Wu, X., Squires, K. D., 1998. Generation of turbulentinflow data for spatially-developing boundary layer simulations. Journalof Computational Physics 140 (2), 233–258.[9] NVIDIA Corporation, 2007–2018. CUDA Toolkit Documentation: cuS-PARSE. http://docs.nvidia.com/cuda/cusparse .[10] Ruetsch, G., Fatica, M., 2013. CUDA Fortran for scientists and en-gineers: best practices for efficient CUDA Fortran programming, 2ndEdition. Elsevier.[11] Sun, X.-H., Sun, H. Z., Ni, L. M., 1989. Parallel algorithms for solutionof tridiagonal systems on multicomputers. In: Proceedings of the 3rdinternational conference on Supercomputing. ACM, pp. 303–312.[12] Zhang, Y., Cohen, J., Owens, J. D., May 2010. Fast tridiagonal solverson the GPU. In: Proceedings of the 15th ACM SIGPLAN Symposiumon Principles and Practice of Parallel Programming. ACM Press, pp.127–136.[13] Zhu, X., Phillips, E., Spandan, V., Donners, J., Ruetsch, G., Romero,J., Ostilla-M´onico, R., Yang, Y., Lohse, D., Verzicco, R., et al., 2018.19fid-gpu: A versatile navier–stokes solver for wall-bounded turbulentflows on gpu clusters. Computer Physics Communications 229, 199–210.201 lgorithm 1:
The y -directional ADI using PDD method Batch size : κ from [4] LHS diagonals : a , b , c RHS diagonals: d , d , d ! from each u, v, w momentum equations Grid cell size : N x , N y , N z r = rank m = N y /pM = m ∗ N x ∗ κ allocate( R (9 M ) ) R (1 : M ) = d R (1 + 3 M : 4 M ) = d R (1 + 6 M : 7 M ) = d kfin = N z /κ for to kfin do! Step 1. configure the matrix V of Eq. (6) in R . R (1 + M : 2 M ), R (1 + 4 M : 5 M ), R (1 + 7 M : 8 M ) ← configureAcorners( a ) R (1 + 2 M : 3 M ), R (1 + 5 M : 6 M ), R (1 + 8 M : 9 M ) ← configureCcorners( c ) ! Step 2. set MPI boundaries to zero.foreach i = 1 : N x and k = 1 : κ do a ( j = 1 , i, k ) = 0 . ; c ( j = m, i, k ) = 0 . end! Step 3. solve for ˜ x, Y of Eqs. (5), (6). call cusparseDgtsv nopivot( handle , mN x , κ , a, b, c , R (1 : 3 M ) , mN x )call cusparseDgtsv nopivot( handle , mN x , κ , a, b, c , R (1 + 3 M : 6 M ) , mN x )call cusparseDgtsv nopivot( handle , mN x , κ , a, b, c , R (1 + 6 M : 9 M ) , mN x ) ! Step 4. send ˜ x ( r )1 , v ( r )1 of the r -th GPU to the ( r − -th GPU.foreach i = 1 : N x and k = 1 : κ and u = 1 : 3 do sbuf x1 (i,k,u) ← pack elements of R (1 : M ) , R (1 + 3 M : 4 M ) , R (1 + 6 M : 7 M ) at ( j = 1 , i, k ) sbuf v1 (i,k,u) ← pack elements of R (1 + M : 2 M ) , R (1 + 4 M : 5 M ) , R (1 + 7 M : 8 M ) at ( j = 1 , i, k ) end cudaStreamSynchronizecall MPI SENDRECV ( sbuf x1 , 3 N x κ , · · · ) call MPI SENDRECV ( sbuf v1 , 3 N x κ , · · · ) ! Step 5. compute y except for the last GPU ( r = p − ).if r (cid:54) = ( p − then Compute y = ( y r +1 , y r +2 ) T of Eq. (7) using the formula for the inverse of a 2 × end! Step 6. send y r − from the r -th GPU to the ( r + 1) -th GPU. Similar to step 4 above. ! Step 7. compute Eq. (8)
Y y = (cid:16) v ( r ) w ( r ) (cid:17) y r − y r +1) , where y − = 0, y p = 0 end d = R (1 : M ) d = R (1 + 3 M : 4 M ) d = R (1 + 6 M : 7 M ) deallocate( R ) Final solution : d , d , d ist of Tables x , y and z direction. Total numberof grid points are written in millions. . . . . . . . . . . . . . . 25 List of Figures y -direction. Each colored block designates a GPU. Computationin the x & z directions can be carried out locally on each GPUas illustrated in the left figure. However data in the y -directionare scattered across different GPUs as shown in the right figure. 2923 Multi-GPU performance of the present code. (a) Scaling ofeach component of the Navier-Stokes equations. Speedup re-sults have been measured on 135M grid points. The gold (cid:7) marker shows the scaling of the entire code. (b) Relativeimportance of each component based on the wall-clock time.Note that TDMA-C which shows the worst scaling takes up18% of the entire solver, and it spends more than half of itstime on all-gather communication. . . . . . . . . . . . . . . . . 306 Wall-clock time of a time-step on various grid sizes using fourGPUs. Specific values given in Table 1. . . . . . . . . . . . . . 317 Turbulent boundary layer over a flat plate at inlet Re θ = 1000.607M grid points have been computed using four P100 GPUs.Q-criterion is used for visualization. . . . . . . . . . . . . . . . 3224rid cell dimension Total grid points (M) Wall-clock time (sec)256 × ×
256 16 0.67512 × ×
256 33 0.98512 × ×
512 67 1.40768 × ×
512 101 2.211024 × ×
512 135 2.601536 × ×
512 202 3.521024 × × × ×
512 303 4.881536 × × × × × ×
768 607 8.82
Table 1: Wall-clock time(sec) measured using four Tesla P100 GPUs. For each grid size,average computation time is measured for one time-step (three sub-steps). Grid dimensionis given as the number of cells in each x , y and z direction. Total number of grid pointsare written in millions. urbulent inflow Convective outlet 𝑥𝑥𝑦𝑦𝑧𝑧 Stress-free conditionNo-slip boundary conditionPeriodic boundary condition
Figure 1: Flow configuration of a flat-plate boundary layer [4]. 𝑖𝑖 , 𝑗𝑗 , 𝑘𝑘 ) A LL - T O - A LL ( 𝑘𝑘 , 𝑖𝑖 , 𝑗𝑗 ) A LL - T O - A LL ( 𝑗𝑗 , 𝑖𝑖 , 𝑘𝑘 ) 𝐓𝐓𝐓𝐓𝐓𝐓 ( a ) C h a ng e i n d a t a o r i e n t a ti on w h e n u s i ng t h e AD I m e t hod ( 𝑖𝑖 , 𝑗𝑗 , 𝑘𝑘 ) A LL - T O - A LL ( 𝑘𝑘 , 𝑖𝑖 , 𝑗𝑗 )( 𝑗𝑗 , 𝑖𝑖 , 𝑘𝑘 ) 𝐓𝐓𝐓𝐓 ( b ) C h a ng e i n d a t a o r i e n t a ti on w h e n s o l v i ng t h e P o i ss on e qu a ti on 𝐓𝐓 ( 𝑘𝑘 , 𝑖𝑖 , 𝑗𝑗 ) 𝐓𝐓 A LL - T O - A LL F i g u r e : L o c a l a nd g l o b a l t r a n s p o s e s u s e d t o c h a n g e d a t ao r i e n t a t i o n . P e r c e n t a g e ( % ) RHS MomentumLHS Momentum (ADI)Poisson equation
TDMAInversion � FFT TDMA
ALL-TO-ALL ALL-TO-ALL
Figure 3: Relative cost of global transpose using four GPUs when compared to other partsof the flow solver. Computation time has been measured on 675M grid points. 𝑥𝑥𝑦𝑦 Figure 4: One-dimensional domain decomposition in the wall-normal y -direction. Eachcolored block designates a GPU. Computation in the x & z directions can be carried outlocally on each GPU as illustrated in the left figure. However data in the y -direction arescattered across different GPUs as shown in the right figure.
234 1234
Speedup N u m b e r o f G P U s R H S AD I- X AD I- Y AD I- Z FF T T D M A - C T O T A L R H S % AD I - X % AD I - Y % AD I - Z % FF T % C o m pu t a t i o n % C o mm un i c a t i o n % T D M A - C % ( a )( b ) F i g u r e : M u l t i - G P U p e r f o r m a n ce o f t h e p r e s e n t c o d e . ( a ) S c a li n go f e a c h c o m p o n e n t o f t h e N a v i e r - S t o k e s e q u a t i o n s . Sp ee dup r e s u l t s h a v e b ee n m e a s u r e d o n M g r i dp o i n t s . T h e go l d (cid:7) m a r k e r s h o w s t h e s c a li n go f t h ee n t i r ec o d e . ( b ) R e l a t i v e i m p o r t a n ce o f e a c h c o m p o n e n t b a s e d o n t h e w a ll - c l o c k t i m e . N o t e t h a t T D M A - Cw h i c h s h o w s t h e w o r s t s c a li n g t a k e s up % o f t h ee n t i r e s o l v e r , a nd i t s p e nd s m o r e t h a nh a l f o f i t s t i m e o n a ll - ga t h e r c o mm un i c a t i o n . W a ll - c l o c k ti m e ( s ec ) Grid points (M)
Figure 6: Wall-clock time of a time-step on various grid sizes using four GPUs. Specificvalues given in Table 1. i g u r e : T u r bu l e n t b o und a r y l a y e r o v e r a fl a t p l a t e a t i n l e t R e θ = . M g r i dp o i n t s h a v e b ee n c o m pu t e du s i n g f o u r P G P U s . Q - c r i t e r i o n i s u s e d f o r v i s u a li z a t i o n ..