[PDF] Multi-GPU Performance Optimization of a CFD Code using OpenACC on Different Platforms

Abstract

This paper investigates the multi-GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on different platforms. The paper shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs due to the noncontiguous memory access. The performance using whatever decompositions can be benefited from a series of performance optimizations in the paper. Since the buoyancy driven cavity code is latency-bounded on the clusters examined, a series of optimizations both agnostic and tailored to the platforms are designed to reduce the latency cost and improve memory throughput between hosts and devices efficiently. First, the parallel message packing/unpacking strategy developed for noncontiguous data movement between hosts and devices improves the overall performance by about a factor of 2. Second, transferring different data based on the stencil sizes for different variables further reduces the communication overhead. These two optimizations are general enough to be beneficial to stencil computations having ghost changes on all of the clusters tested. Third, GPUDirect is used to improve the communication on clusters which have the hardware and software support for direct communication between GPUs without staging CPU's memory. Finally, overlapping the communication and computations is shown to be not efficient on multi-GPUs if only using MPI or MPI+OpenACC. Although we believe our implementation has revealed enough overlap, the actual running does not utilize the overlap well due to a lack of asynchronous progression.

Full PDF

MMulti-GPU Performance Optimization of a CFD Codeusing OpenACC on Diﬀerent Platforms

Weicheng Xue ∗ , Christoper J. Roy Virginia Tech Kevin T. Crofton Department of Aerospace and Ocean Engineering,215 Randolph Hall, Blacksburg, VA 24061, US

Abstract

This paper investigates the multi-GPU performance of a 3D buoyancy drivencavity solver using MPI and OpenACC directives on diﬀerent platforms. Thepaper shows that decomposing the total problem in diﬀerent dimensions aﬀectsthe strong scaling performance signiﬁcantly for the GPU. Without proper per-formance optimizations, it is shown that 1D domain decomposition scales poorlyon multiple GPUs due to the noncontiguous memory access. The performanceusing whatever decompositions can be beneﬁted from a series of performanceoptimizations in the paper. Since the buoyancy driven cavity code is latency-bounded on the clusters examined, a series of optimizations both agnostic andtailored to the platforms are designed to reduce the latency cost and improvememory throughput between hosts and devices eﬃciently. First, the parallelmessage packing/unpacking strategy developed for noncontiguous data move-ment between hosts and devices improves the overall performance by about afactor of 2. Second, transferring diﬀerent data based on the stencil sizes fordiﬀerent variables further reduces the communication overhead. These two op-timizations are general enough to be beneﬁcial to stencil computations havingghost changes on all of the clusters tested. Third, GPUDirect is used to im-prove the communication on clusters which have the hardware and software ∗ Corresponding author

Email addresses: [email protected] (Weicheng Xue), [email protected] (Christoper J.Roy) Graduate Assistant, AIAA Student Member AIAA Associate Fellow

Submitted preprint a r X i v : . [ c s . D C ] J un upport for direct communication between GPUs without staging CPU’s mem-ory. Finally, overlapping the communication and computations is shown to benot eﬃcient on multi-GPUs if only using MPI or MPI+OpenACC. Althoughwe believe our implementation has revealed enough overlap, the actual runningdoes not utilize the overlap well due to a lack of asynchronous progression. Keywords:

Multi-GPU, OpenACC, MPI, Domain Decomposition,Performance Optimization, GPUDirect2 . Introduction

Computational Fluid Dynamics (CFD) is a method which can be used tosolve physical problems in the ﬁeld of ﬂuids, usually requiring a lot of compu-tation. In order to obtain more accurate numerical solutions for challengingproblems, researchers are using very ﬁne meshes or high-order schemes whichrequire much better resources such as a larger memory and faster processor.However, generally the memory cannot be inﬁnitely large and a processor can-not hold an inﬁnite number of transistors. These limitations may require codesto be written in a data parallel way, i.e., decomposing a big problem into smallpieces and distributing these small problems to multi/many-cores or even ac-celerators such as GPUs. Applying high performance computing (HPC) in theCFD [1] area is necessary and has aroused CFD researchers’ interest.On multicore/manycore CPU systems or equivalent, there are three commonparadigms for HPC: OpenMP, MPI, and hybrid MPI+OpenMP. OpenMP [2]is designed for shared memory systems so that data can be shared among allthreads but this comes with the risk of race conditions to exist if multiple threadsare modifying the same data. Also, scaling performance across multiple nodesor sockets may be poor on distributed memory systems (more usually used thanshared memory systems). MPI [3] is a message passing standard designed forvarious platforms including shared and distributed memory architectures. Datais moved between processors through sending and receiving messages. How-ever, programming with MPI is more complicated than with OpenMP as itrequires extra care to decompose the problem well and implement eﬃcient com-munications. Usually, communication may be a signiﬁcant bottleneck when thecodes is scaled up to a lot of processors. Hybrid MPI+OpenMP methods [4]are therefore proposed to combine the advantages of MPI and OpenMP. Thehybrid model is a good match with modern multicore/manycore architectures,so it can be programmed eﬃciently using two levels of parallelism: MPI forthe inter-node/socket communication and OpenMP for the intra-node/socketcomputation and communication. However, hybrid MPI+OpenMP cannot be3asily used on GPUs directly due to a lack of full support of OpenMP 4.0 (orlater) from the compiler development.In addition to the CPU computing, GPU computing have been gaining alot of interests [5]. This attention is because of the GPU’s high compute ca-pability and memory bandwidth as well as their low power consumption. Asingle GPU have thousands of threads, therefore numerous threads can executea same instruction simultaneously on multiple data points, known as single in-struction multiple threads (SIMT). When executing a program on the GPU, thehighly compute-expensive portion of a program is oﬄoaded to the GPU. Thennumerous threads on the GPU execute the code simultaneously to achieve ahigh speedup. File I/O, branch controls, printout, etc. remain on the CPUsince the CPU has the ﬂexibility to perform these tasks while GPUs are notas eﬃcient for these complex tasks (if they are even possible). One importantthing to mention here is that although GPUs provides higher memory band-width than CPUs, diﬀerent memory access patterns may signiﬁcantly aﬀect theactual memory throughput [6], which should be considered carefully.Three language extensions/libraries are widely applied to port codes toGPUs [7]. They are OpenCL, CUDA and OpenACC. OpenCL and CUDAare C/C++ with some extensions while OpenACC is a compiler directive-basedinterface, similar to OpenMP. OpenCL and CUDA are low level programmingmodels so that they require users to have some background in computer archi-tecture systems. Also, programming with OpenCL or CUDA is diﬃcult, as usersneed to rewrite and tune their codes on various GPUs every time. CUDA getsa strong support from NVIDIA but it cannot be compatible well with otherGPUs such as AMD. OpenCL is open source and we found that it has notbeen commonly used in real world GPU-accelerated CFD codes, possibly dueto the high complexity of programming and a lack of good ecosystem support.Diﬀerent from OpenCL and CUDA, OpenACC is a high level programmingmodel that enables users to accelerate their CFD codes more readily on vari-ous GPUs without intruding their legacy codes completely. Programmers usingOpenACC [8] can be somewhat agnostic about the GPU architecture compared4o using OpenCL and CUDA because compilers such as those developed by PGI(acquired by NVIDIA) can hide a lot of details and decide how to optimize thecode (although it may not be optimal). Also, because of its directive-based fea-ture and good support for portability, OpenACC can be much easier to use onvarious platforms compared to CUDA. We will also show this beneﬁt becausethere is little code modiﬁcation across platforms. However, to gain good per-formance across diﬀerent platforms, the features of the architecture and somelow level optimizations should be taken into account. Apart from the optionsmentioned, OpenMP can be a potential viable choice for the GPU once the de-velopment of compilers catch up in the future. Because OpenACC provides aneasy way of programming [9], a good feature for portability across platforms [10]and good parallel performance if optimized enough [11], OpenACC was appliedto port our CFD code to the GPU.OpenACC has already been used for various GPU-accelerated CFD codesor related applications. Gong et al. [12] presented an optimized OpenACC ver-sion for NekBone, which is a core kernel of the incompressible Navier-Stokessolver Nek500, based on their group’s prior work. They ported the optimizedcode to multiple GPU systems and obtained a parallel eﬃciency of 79.9% on1024 GPUs. However, the code they worked on is just a kernel, not a completeCFD code. Hoshino et al. [11] found that although OpenACC is 50% slowerthan CUDA for a naive implementation, the gap can be decreased to only 2%after careful manual optimizations. They also pointed out that there are someintrinsic deﬁciencies of OpenACC, such as a lack of interface for shared mem-ory and inter-thread communication. Searles et al. [13] studied a wavefrontbased mini-application for a production code for nuclear reactor modeling. Itis interesting and rare to see in their work that the OpenACC implementationis even slightly faster than CUDA. Their work mainly focused on exploringcomplex parallel patterns in their code and exploring the scalability using MPIacross diﬀerent platforms. In summary, OpenACC is easy to use and also goodfor portability across diﬀerent platforms, however to obtain good performance,careful pertinent optimizations for an application may need to be designed.5o assess the performance of a code accelerated by the CPU or GPU, weakscaling and strong scaling performance are often measured. The major diﬀerencebetween the two scalings is whether one keeps the total problem size ﬁxed (strongscaling) or the problem size per processor ﬁxed (weak scaling), while addingmore processors. Obviously maintaining strong scaling is more challenging.Commonly both scalings are investigated to satisfy diﬀerent situations such assolving a ﬁxed problem as fast as possible (strong scaling) or solving as big of aproblem as possible (weak scaling). In both situations we want to max out allcompute nodes or resources to gain the maximum speedup. In the CFD area,we are more interested in the weak scaling performance as we hope to solvea larger and complex problem faster, if more compute resources are available.However, there are also a lot of occasions in which small problems need to besolved if the requirement for numerical accuracy is not high. Therefore, boththe weak scaling and strong scaling performance are measured in this paper.Prior to the work presented in this paper, Pickering et al. [14] examined theprocess of applying OpenACC to a 2D CFD code using both single precisionand double precision. They also applied OpenMP’s fork/join execution modelto scale the performance up to 4 NVIDIA C2070 GPUs with a strong scalingeﬃciency of 95%. Instead of using the OpenMP+OpenACC model, Baghapouret al. [15] switched to the MPI+OpenACC model and scaled a 3D CFD codewell up to both 32 CPUs and 32 GPUs on a distributed cluster. In their work,they used 1D domain decomposition to distribute the load to diﬀerent proces-sors and increased the grid size in only one dimension for their weak scalingperformance. Xue et al. [16] compared multi-CPU/GPU performance using3D, 2D and 1D decompositions and gave a primitive analysis of their diﬀer-ences. Also, two performance optimizations including an pack/unpack methodfor data exchange between hosts and devices was designed, and this pack/un-pack method was proved to improve the performance using 3D decompositiongreatly on a platform using old GPUs (NVIDIA C2070). In the current paper,the eﬀects of multiple factors such as the platform architecture, decompositionmethods, pack/unpack, strong v.s. weak scaling and GPUDirect will be inves-6igated. Also, the limitations of overlapping communication and computationif applying MPI or MPI+OpenACC are presented.

2. CFD Code: Buoyancy Driven Cavity Solver

The 3D Buoyancy Driven Cavity (BDC) problem has a cubic domain, a ver-tical wall and its opposing wall have diﬀerent temperatures, and the horizontalwalls are adiabatic. A gravitational force is added to the air in the square cavity.Heat ﬂux caused by the temperature diﬀerence leads small density changes inthe ﬂuid (Boussinesq approximation), and the buoyancy eﬀect (density change)causes the ﬂuid to convect in the cavity.The CFD code written with Fortran 2003/2008 in the paper solves the classic3D BDC problem [15], which is a system of 3D incompressible Navier-Stokesequations. An artiﬁcial compressibility method developed by Chorin [17] is used.The artiﬁcial viscosity term makes the system of equations to be hyperbolic sothat steady state solution can be obtained through time marching. The CFDcode use a ﬁrst order Euler explicit scheme for temporal discretization, anda second-order central-diﬀerence scheme with an artiﬁcial dissipation term forspatial discretization. The artiﬁcial dissipation term is applied to the continuityequation to alleviate odd-even decoupling. The numerical damping term isbased on the fourth-derivative of pressure and is discretized using a second-order central ﬁnite diﬀerence scheme. The discretized form of the system of thegoverning equations can be written as,1 β ∂p∂t + ρ ∂V i ∂x i = (cid:15) i ∂ p∂x i (1) ∂V j ∂t + V i ∂V j ∂x i = − ρ ∂p∂x j + ν ∂ V j ∂x i + σ ( T − T ∞ ) g j (2) ∂T∂t + V i ∂T j ∂x i = α ∂ T j ∂x i (3)where β is an artiﬁcial compressibility parameter calculated using the localvelocity magnitude along with a reference velocity deﬁned by the user, (cid:15) are nu-7erical dissipation coeﬃcients controlling stability, ν is the kinematic viscosityof the ﬂuid, and α is the thermal dissipation rate.In this paper, the size of the cavity is 0.05 m in all three dimensions. Pressureis extrapolated to the ghost cells at adiabatic walls using a one-sided secondorder scheme. Temperature is similarly extrapolated to the horizontal wallghost cells using a second order scheme. Pressure is rescaled at the center pointof the cavity in every iteration. The meshes used for the BDC code are uniformand their size range from 32 to 1024 . The Rayleigh number is set to be 100,000for the convection problem. Most of the constant settings aﬀecting the ﬂow donot aﬀect the parallel performance.

3. Implementation

In the BDC code studied here, since we use the fourth-derivative pressuredissipation term and a second-order central-diﬀerence scheme for all spatialderivatives, the numerical stencil size for the pressure is 5 and for other prim-itive variables is 3. This can be utilized to design an optimization to reducethe data exchange across processors (

Optimized V2 , will be introduced later).Fig. 1 shows the stencils for a node in the computational domain. The iterativeresidual calculation in the explicit CFD code is intrinsically one kind of stencilcomputation. For each node, it needs the data in its two stencils to compute andﬁll in the residual array, which is later used to update the primitive variables.There should be a nested loop over all nodes in the spatial domain, and a timestep loop containing all stencil computations to iterate in pseudo time domain.When programming, data locality should be considered to use cache more eﬃ-ciently. For a 3D array A ( i, j, k ) in Fortran, i is set to be incremented fastest(the innermost loop) and k the slowest (the outermost loop). This layout is alsogood for the GPU, as the GPU prefers the coalesced memory access pattern inwhich contiguous threads in a thread block operate on the consecutive memorylocations. It should be noted here that the index directions ( i , j and k ) are8ligned with the spatial directions ( x , y and z ) in this problem. Therefore, theuse of ( i, j, k ) are mixed with the use of ( x, y, z ) in the paper. Figure 1: Stencil (black+red: velocity and temperature stencil, blue+black+red: pressurestencil)

Many methods can be used to decompose a computational domain suchas structured partitioning [18] and graph partitioning [19]. For a CFD prob-lem with single-block structured grid such as a BDC problem running on pureCPUs or pure GPUs, there are three structured ways: 1D decomposition, 2Ddecomposition and 3D decomposition. Which way of decomposing the domainperforms the best greatly depends on the application and computer architec-ture. On one hand, the surface-area/volume ratio determines the total size ofdata transferred between processors and the total size of ghost cells, and 3D de-composition has the lowest area/volume ratio (least ghost cells). On the otherhand, the frequency of data transfers between processors can greatly aﬀect theperformance, especially when the memory bandwidth or latency issue becomesimportant, and 1D decomposition has the least times of data transfers. Besides,for stencil computations like ours, 1D or even 2D decomposition may generatetoo thin slices that invalidates the spatial discretization scheme if scaling up to9 large number of processors (strong scaling). Thus, it is worthwhile to inves-tigate the eﬀect of various domain decompositions on diﬀerent platforms. Xueet al. [16] showed that 2D decomposition scales better up to 32 CPUs com-pared to 1D and 3D decomposition on a platform, and 3D decomposition canoutperform 1D decomposition if applying optimizations. However, they did notshow the comparison between 3D decomposition and 2D decomposition. Also,they only tested the code on a single platform having old GPUs. An example3D decomposition adopted in this paper is shown in Fig. 2. Each processor isgiven a decomposed block, with each cutting face contacting a neighbour block.Ghost nodes are used to store decomposed boundary information transferredfrom neighbours. 1D and 2D decompositions are similar as 3D decompositionbut have fewer decomposed directions and fewer decomposed boundaries.

Figure 2: 3D domain decomposition

For a given number of processors, there may be many combinations for either1D, 2D or 3D decomposition. For general situations, we try to decompose thedomain evenly in all the available dimensions but decompose more in the sloweststride index direction, to preserve more contiguous data after decomposition.This method is designed to 1) divide the domain in the available dimensions asevenly as possible, 2) utilize enough processors, 3) decompose with a priorityalong the k dimension ﬁrst, then j , and ﬁnally i , as Fortran is a column-majoredlanguage. 10 .3. Hardware ConﬁgurationHokieSpeed. Although now decommissioned in June 2017, HokieSpeed [20] wasa cluster at Virginia Tech and was previously in the list of Green500. Hok-ieSpeed [20] had 204 nodes using a quad data rate InﬁniBand interconnect.Each node was outﬁtted with 24GB memory, two six-core Xeon E5645 CPUsand two NVIDIA M2015/C2050 GPUs. Every GPU had 14 multiprocessors(MP) and 3GB memory. The peak bandwidth to the 3GB shared memory was148.4GB/s. Every MP had 32 CUDA cores, 48KB shared memory and 16KBL1 cache. All the access to the global memory went through the L2 cache of size512KB. The peak double precision performance was 513 GFLOPS. The com-pilers used on HokieSpeed were PGI 15.7 and Open MPI 1.10.0. A compileroptimization of -O4 was used.

NewRiver.

NewRiver [21] is a cluster at Virginia Tech (VT). It has 39 GPUnodes shared by the whole VT community. On NewRiver [21], each of thesenodes is equipped with two Intel Xeon E5-2680v4 (Broadwell) 2.4GHz CPU(28 cores/node in all), 512 GB memory, and two NVIDIA P100 GPUs. EachNVIDIA P100 GPU is capable of up to a theoretical 4.7 TeraFLOPS of double-precision performance. The NVIDIA P100 GPU oﬀers much higher GFLOPSand memory bandwidth compared with the NVIDIA C2050 GPU on Hok-ieSpeed. The modules used on NewRiver are PGI 17.5, CUDA 8.0.61 and OpenMPI 2.0.0 or MVAPICH2-GDR 2.3b. It should be mentioned that MVAPICH2-GDR 2.3b is a CUDA-aware MPI wrapper compiler which supports GPUDirect(if this feature is turned on). An compiler optimization of -O4 is used. Themaximum number of nodes which can be used is 12, but only 8 is used in thispaper. Thus, the maximum number of GPUs used on the NewRiver cluster is 16(when using both Open MPI 2.0.0 or MVAPICH2-GDR 2.3b). If not speciﬁed,Open MPI 2.0.0 is used.

Cascades.

Cascades [22] is another cluster at VT. It has 40 GPU nodes sharedby the whole VT community. On Cascades [22], each of these nodes is equippedwith two Intel Skylake Xeon Gold 3 Ghz CPUs (24 cores/node in all), 768 GB11emory, and two NVIDIA V100 GPUs. Each NVIDIA V100 GPU is capableof up to 7.8 TeraFLOPS of double precision performance, which is 66% higherthan the P100 GPU on the NewRiver cluster. The NVIDIA V100 GPU oﬀersthe highest GFLOPS and memory bandwidth among the GPUs we used. Themodules used on Cascades are PGI 18.1, CUDA 8.0.61 and Open MPI 3.0.0 orMVAPICH2-GDR 2.3b. An compiler optimization of -O4 is used. Similarly, themaximum number of GPUs used on the Cascades cluster is 16 when using OpenMPI 3.0.0. However, when switching to MVAPICH2-GDR 2.3b, since Cascadesuses ”srun” to run MPI programs instead of using ”mpirun rsh” (as Slurm isused on the Cascades), the maximum number of GPUs which can be used isonly 8 (using more would cause the eﬃciency to drop to about 1%, which isnot reasonable and caused by some unknown issues related to Slurm). If notspeciﬁed, Open MPI 3.0.0 is used.

4. Results

Before presenting any performance results, the ﬁrst indispensable step forany parallelization or optimization is to quantify the solution diﬀerence betweenthe serial solution and the parallel solution. An iteratively converged doubleprecision solution using 32 GPUs on HokieSpeed for a 256 mesh is given inFig. 3. As can be seen, the solution is smooth everywhere and the relativeerror compared to the serial solution is just O(10 − ), also seen in Section Vof Ref [16]. The small diﬀerence is due to round-oﬀ error accumulations onthe GPUs. For all the optimizations presented in this paper, parallel solutioncorrectness is always guaranteed. 12 a) 3D pressure contour (b) 2D temperature contour at y=0.025 mFigure 3: 3D BDC solution Two basic metrics used in this paper are parallel speedup and eﬃciency.Speedup denotes how much faster the parallel version is compared to the serialversion of the code, while eﬃciency represents how eﬃciently the processors areused. They are deﬁned as follows,speedup = t serial t parallel (4)eﬃciency = speedupnp (5)where np is the number of processors (CPUs or GPUs).In order for the performance of the code to be measured and compared wellon diﬀerent platforms and for diﬀerent problem sizes, the wall clock time periteration step is converted to a ssspnt (scaled size steps per np time) value whichis deﬁned in Eq.6. This metric has some advantages. First, GFLOPS requiresknowing the number of operations while ssspnt does not require. In most codes,it is usually diﬃcult to know the number of operations. Second, eﬃciency com-parison across diﬀerent platforms is not intuitive as how fast the program runsis still unknown, but ssspnt is clearer for knowing the absolute speed. Also,the beneﬁt applies to strong and weak scaling performance comparison if using13sspnt. For example, a linear scaling problem has a constant ssspnt value but thevalues can be diﬀerent for the strong and weak scaling. Using ssspnt, diﬀerentproblems, platforms, strong and weak scaling performance can be compared di-rectly. In this paper, when gaining productive performance results, unnecessaryI/O (writing out solutions to ﬁles and writing residual norms to the screen)are turned oﬀ, which is commonly applied when testing the performance inliterature. ssspnt = s size × stepsnp × time (6)where s is a scaling factor which scales the smallest platform ssspnt to the rangeof [0,1]. In this paper, s is set to be 10 − for all test cases. size is the problemsize, steps is the iteration steps and time is the program wall clock time for steps iterations. In the weak scaling analysis, the problem size needs to be increased accord-ingly when the number of processors increases. However, the way the problemscales can vary. For the BDC codes, since the problem is a 3D problem, theproblem can increase in either 1D, or 2D, or 3D. Therefore, we will investigatethe eﬀect of how problem size grows on the weak scaling performance. Twotypes of grid growth for the weak scaling are applied, seen in Table 1 and Ta-ble 2. Table 1 keeps the problem size grow in the exactly the same way whenincreasing the number of processors, no matter whether the codes uses 3D, 2Dand 1D decomposition, and Table 2 scales the problem size in accordance withthe way the number of processors grow. For example, if using 8 processors,the problem size for 3D decomposition is 512 × × × × × × × ×

256 (as the processor dims can be either 1 × × × × able 1: Grid growth type 1 Problem size 3D decomposition 2D decomposition 1D decomposition(256,256,256) (1,1,1) (1,1,1) (1,1,1)(256,256,512) (1,1,2) (1,1,2) (1,1,2) or (2,1,1)(256,512,512) (1,2,2) (1,2,2) (1,1,4) or (4,1,1)(512,512,512) (2,2,2) (1,2,4) (1,1,8) or (8,1,1)(512,512,1024) (2,2,4) (1,4,4) (1,1,16) or (16,1,1)(512,1024,1024) (2,4,4) (1,4,8) (1,1,32) or (32,1,1)

Table 2: Grid growth type 2

A systematic multi-CPU scaling performance test is performed on all thethree clusters mentioned earlier in this paper. CPU strong scaling and weakscaling performance using 1D, 2D and 3D decompositions are shown in Fig. 4.Here CPUs are added by sockets, i.e., adding a certain number of sockets everytime (only using one CPU in each socket, similar to adding GPUs as every socketonly has one GPU), or equivalently setting the processor per node (ppn) to be 2.15ig. 4 highlights a tradeoﬀ that exists for the diﬀerent domain decompositiontechniques. Choosing a domain decomposition scheme is a balance betweenmaintaining a small surface to volume ratio of the subdomains and minimizingthe number of neighbors for each subdomain. From Fig.4, 1D decompositiongenerally performs the worst for both the strong and weak scaling on NewRiverand Cascades. 1D decomposition has only 2 neighbors but it has the highestsurface to volume ratio meaning that there is a lot of data to transfer betweenthe blocks. 2D and 3D decomposition perform the best for the strong scalingand weak scaling, respectively. 3D decomposition has the smallest surface tovolume ratio but has 6 neighbors meaning it has to perform 6 communicationseach with their own overhead. The weak scaling curves are ﬂatter than thestrong scaling, which is reasonable as the CPU has more work to do. It isalso found that decomposing in the x dimension should be avoided for theCPU in strong scaling as the performance deteriorates faster compared to otherdecompositions, which is especially obvious on the Cascades cluster. (a) CPU strong scaling (b) CPU weak scalingFigure 4: Multi-CPU scaling using diﬀerent decompositions Also, Fig. 4 shows that a super-linear scaling occurs on the NewRiver cluster.To investigate why there is super-linear phenomenon, the ppn is changed. Fig. 5shows the eﬀect of ppn on the performance. For both the strong and weakscaling performance shown in Fig. 4, since the ppn is 2, the number of CPUs isincreased along with other resources such as memory and memory bandwidth16ncreased at the same pace (as the number of sockets increases), which mayreduce the communication overhead and latency cost. To test this hypothesis,we also tried setting ppn to be higher than 2 (using less nodes) and did notobserve the super-linear scaling. Also, no matter what the ppn is, the Cascadescluster does not show a super-linear scaling. It can be concluded that the super-linear phenomenon depends highly on the platform communication system andthe implementation. (a) CPU strong scaling (b) CPU weak scalingFigure 5: The eﬀect of ppn (3D decomposition)

Although multi-CPU implementation is not the focus of this paper, we areinterested in the CPU performance comparison between diﬀerent platforms,which is shown in Fig. 6. All the results here use 3D domain decomposition. Itcan be found that the platform aﬀects the performance most, not the numberof CPUs or whether the scaling is strong or weak. The performance on Cas-cades is about 1.245 times faster as that on NewRiver, which is very close tothe clock rate ratio of 1.25 (3Ghz/2.4Ghz). For the CPU scaling, 3D domaindecomposition maintains the eﬃciency very well so it is recommended.17 igure 6: Multi-CPU performance comparison across platforms (3D domain decomposition)

The focus of this paper is the multi-GPU implementation and its perfor-mance optimizations. There are multiple optimizations of the GPU-acceleratedCFD code in this subsection: the ﬁrst version is a baseline code, which can beregarded as a naive GPU implementation based on the CPU code, and the otherversions are incremental optimization versions, with more optimizations basedon the previous version. Actual memory bandwidth can be improved greatly aswell communication overhead can be reduced by applying these optimizations.When showing results later, the ”1D”, ”2D” and ”3D” in the legend denotes1D, 2D and 3D decompositions, respectively. For 1D decomposition, the letterin the parentheses after ”1D” denotes which dimension to decompose. Moredetails about these diﬀerent GPU-accelerated versions are given as follows.

Baseline.

The multi-CPU code is directly ported to GPUs by inserting Ope-nACC directives in the parallel CPU code. This baseline GPU code does not useGPUDirect techniques. Therefore, data on devices need to be updated to/fromhosts using ” !$acc update ” clauses. Asynchronization clauses are used toreduce some synchronization overhead between hosts and devices.

Optimized V1.

This version is to improve the actual memory bandwidth andreduce latency cost using 3D decomposition, as we found that the actual band-width between the host and the device is very small because of non-contiguous18ata transfer. As Fortran is a column-majored language, the ﬁrst index i of amatrix A ( i, j, k ) denotes the fastest change. If any decomposition exists in the i index direction (3D decomposition or 1D decomposition in i ), the decompositionin the i index direction can generate chunks of data (at j − k planes) which arehighly non-contiguous. Therefore, the optimization is targeted at solving thisissue by converting the non-contiguous data into a temporary contiguous arrayin parallel and updating this temporary array between hosts and devices usingOpenACC update clauses. For this optimization, temporary arrays are createdonly if decomposition in the direction with the fastest change ( i index direction)exists, as decomposition in other index directions can still generate chunks ofcontiguous data. A pseudo code of how to use buﬀers is given in Listing. 1. Theprocedure can be summarized as follows:1. Allocate send/recv buﬀers only for boundary cells on i planes on de-vices and hosts if decomposition happens in the i dimension, as the non-contiguous data on i planes makes data transfer very slow.2. Pack the noncontiguous block boundary data to a buﬀer on the senderdevice side. This process can be parallelized using ” !$acc loop ” clausesand has little overhead. The host buﬀer is then updated using OpenACCupdate host clauses.3. Have hosts transfer the data through MPI Isend/MPI Irecv calls (whichare one-sided non-blocking calls). Then block MPI calls using MPI WAITALLto ﬁnish data transfer.4. Update the recv buﬀer on devices using OpenACC update device clausesand ﬁnally unpack the contiguous data stored in recv buﬀer back tononcontiguous memory on devices, which can also be parallelized using” !$acc loop ” clauses. Listing 1: A pseudo code of optimization on non-contiguous data transfer between hosts anddevices $acc data present(send_buffer(:,:), soln(:,:,:,:))!Pack send_buffer(:,1) with back boundary data!$acc parallel loop collapse(4) async(1)do var = 1, 5do k= 1, k_nodesdo j= 1, j_nodes!Pack up two layers of cellsdo i = 1, 2indx = var*k_nodes*j_nodes*2+k*j_nodes*2+j*2+i!Interior cell index starts at 3send_buffer(indx,1) = soln(i+2,j,k,var)!Similar routine to pack send_buffer(:,2) with front boundarydata async(2)!Update send_buffer on hosts!$acc update host(send_buffer)!$acc wait!Send/Recv between hostsMPI_IRECV(recv_buffer)MPI_ISEND(send_buffer)MPI_WAITALL!Update recv_buffer on devices!$acc update device(recv_buffer)!$acc wait!$acc data present(recv_buffer(:,:), soln(:,:,:,:))!Unpack the data in recv_buffer to soln ghost locations ptimized V2. The use of only one stencil in the

Optimized V1 makes thecommunication pattern and implementation simpler but may not be eﬃcient.

Optimized V2 is designed to reduce the amount of data exchanged. Since thepressure requires a larger stencil while other primitive variables do not, We cantransfer less less data based on their own stencil size compared to

Optimized V1 .What is more, more overlap of asynchronous communication can be achieved asa big loop is split into two asynchronous loops. A pseudo code of this optimiza-tion is given in Listing. 2. It should be noted that only changes based on theprevious optimization are emphasized in a new Listing.

Listing 2: A pseudo code of stencil based communication optimization !$acc data present(send_buffer(:,:), soln(:,:,:,:))!Pack send_buffer(:,1) with back boundary data!Move pressure into send_buffer(:,1)!$acc parallel loop collapse(3) async(1)do k= 1, k_nodesdo j= 1, j_nodesdo i = 1, 2var = 1 ! pressure onlyindx = k*j_nodes*2+j*2+isend_buffer(indx,1) = soln(i+2,j,k,var)!Update starting index in send_buffer(:,1)indx_p = k_nodes*j_nodes*2!Move velocities & temperature into send_buffer(:,1)!$acc parallel loop collapse(3) async(2)do var = 2, 5do k= 1, k_nodesdo j= 1, j_nodesindx = (var-2)*k_nodes*j_nodes+k*j_nodes+jsend_buffer(indx_p+indx) = soln(3,j,k,var) Pack send_buffer(:,2) with front boundary data!$acc parallel loop collapse(3) async(3) & async(4)!Update send_buffer on hosts...!Send/Recv between hosts...!Update recv_buffer on devices...!$acc data present(recv_buffer(:,:), soln(:,:,:,:))!Unpack the data in recv_buffer to soln ghost locationsasync(1:4)

Optimized V3.

In the

Optimized V1 and

Optimized V2 , contiguous-memoryarrays are created only for the i index direction. However, if an decompositionexists in the j or k index direction, then we may also need an array in the j and k direction. It should be noted that although real cell data on k boundaryfaces do not need to be packed into buﬀers, using buﬀers on k faces may still behelpful considering that there are ghost cells on k boundary faces which breaksthe contiguity. We found that on the HokieSpeed cluster, using such arraysimproves the performance very little, but the performance can be improvedsigniﬁcantly on the NewRiver and Cascades cluster. The procedure of creatingarrays and parallelizing the pack/unpack process in the j and k direction is verysimilar to that in the i direction, so there is no need to show an pseudo codehere. Readers can reference Listing. 1.We will ﬁrst show the beneﬁt of applying Optimized V1 on the HokieSpeedcluster, i.e., creating the contiguous-memory and parallelizing the process of22ack/unpack. Fig. 7 shows the weak scaling eﬃciency of the diﬀerent GPUcode versions introduced earlier. The

Baseline version using 3D decomposi-tion performs poorly, which is very bandwidth limited due to noncontiguousdata transfer when 3D decomposition is used. Both

Baseline

2D and 1D per-form much better than

Baseline

Optimized V2 performs better than

Optimized V1 be-cause it transfers less data and overlaps asynchronous data transfers better.Similar performance improvement can also be seen on the NewRiver and Cas-cades clusters but the baseline scaling performance are not shown in this paper,as the baseline version runs too slow (or it can be regarded as an improperimplementation) and its performance assessment is not a focus in this paper.

Figure 7: Multi-GPU scaling performance on the HokieSpeed cluster

From Fig. 7, we know that

Optimized V2 performs the best on the Hok-ieSpeed cluster. It should be mentioned that

Optimized V2 and

Optimized V3 perform equivalently on HokieSpeed. However, NewRiver and Cascades showdiﬀerent favors. The performance comparison of

Optimized V2 and

OptimizedV3 on NewRiver and Cascades will be shown in Sec 4.6, as there the perfor-23ance of applying diﬀerent MPI compilers and GPUDirect will include such ancomparison. To avoid redundant contents, only the

Optimized V3 performanceon NewRiver and Cascades are shown here. The GPU strong scaling and weakscaling performance using 1D, 2D and 3D decompositions are shown in Fig. 8.From Fig. 8, 3D decomposition performs the best for the strong scaling, then2D decomposition follows. 1D decomposition in the x or z dimension makes theperformance drop quickly, especially on Cascades. Similar to the CPU weakscaling, the weak scaling curves are ﬂatter than strong scaling. Recall that thisweak scaling applies the grid growth type 2, which is in Table 2. Since theproblem size increases in accordance with the way np grows, 1D decompositionperforms the best as every decomposed block has the least number of neighbourscompared to 2D and 3D decomposition. (a) GPU strong scaling (b) GPU weak scalingFigure 8: Multi-GPU scaling using diﬀerent decompositions ( Optimized V3 ) Fig. 9 shows the performance comparison across platforms. Diﬀerent fromFig. 6, multiple factors can aﬀect the multi-GPU performance signiﬁcantly, in-cluding the number of processors, platforms, whether a strong or weak scaling.When the number of GPUs increases, the eﬃciency drops signiﬁcantly for boththe strong and weak scaling, but the weak scaling eﬃciency holds a relativelyhigher value compared to the strong scaling. Cascades shows an about 2 timesfaster speedup compared to NewRiver.24 igure 9: Multi-GPU performance comparison across platforms (

Optimized V3 ) To investigate the eﬀect of diﬀerent grid growth methods on the weak scalingperformance, some cases are tested on NewRiver and Fig. 10 shows such results.Since the two grid growths are the same if applying 3D decomposition, there isonly one curve for 3D decomposition. It can be found that the performance isbetter for growth type 2. If growth type 1 is applied, then 3D decompositionperforms the best, and it may mislead readers that 3D decomposition is thebest for the GPU weak scaling, which is not this paper’s intention. It should beemphasized again that which decomposition should be used for the weak scalingdepends on how the grid grows.

Figure 10: Weak scaling performance applying diﬀerent grid growth methods (

Optimized V3 ) .6. CUDA-aware MPI and GPUDirect The optimizations introduced in Section 4.5 have improved the eﬃciencysigniﬁcantly on diﬀerent platforms. However, we are still interested in improvingthe scaling performance further on NewRiver and Cascades since they have somemodern GPUs. Thus, it became important to determine ways of reducing thiscommunication cost by using CUDA-aware MPI and GPUDirect [23]. Later,performance comparisons will be made between using Open MPI, MVAPICH2or MVAPICH2-GDR with GPUDirect.HokieSpeed does not support CUDA-aware MPI. Thus, all inter-node GPUcommunications on HokieSpeed had to go through host memory. This stagingdeteriorates the performance greatly. Using CUDA-aware MPI, we only needto send GPU buﬀers instead of CPU buﬀers. CUDA-aware MPI has two per-formance beneﬁts [23]. First, operations which require message transfer can bepipelined, which improves the memory throughput. Second, acceleration tech-niques such as GPUDirect can be utilized by the MPI library transparently tothe user.GPUDirect is an umbrella word for several GPU communication accelerationtechnologies. It provides high bandwidth and low latency communication be-tween NVIDIA GPUs. There are three levels of GPUDirect [24]. The ﬁrst levelis GPUDirect Shared Access, introduced with CUDA 3.1. This feature avoids anunnecessary memory copy within host memory between the intermediate pinnedbuﬀers of the CUDA driver and the network fabric buﬀer. The second level isGPUDirect Peer-to-Peer transfer (P2P transfer) and Peer-to-Peer memory ac-cess (P2P memory access), introduced with CUDA 4.0. This P2P memory accessallows buﬀers to be copied directly between two GPUs on the same node. Thelast is GPU RDMA (Remote Direct Memory Access), with which buﬀers canbe sent from the GPU memory to a network adapter without staging throughhost memory. The last feature is not supported on NewRiver and Cascades asit pertains to speciﬁc versions of the drivers (both from NVIDIA for the GPUand Mellanox for the Inﬁniband) which are not installed (other dependenciesexist on NewRiver and Cascades, particularly parallel ﬁlesystems). Although26PU RDMA is not available, the other aspects of GPUDirect can be tested todetermine its eﬀect on the scaling performance using MVAPICH2-GDR.

In this subsection, we will ﬁrst show the beneﬁts of applying GPUDirect ina node. Table 3 shows the strong scaling performance comparison of diﬀerentGPU code versions using 2 GPUs (intra-node performance) on NewRiver. Theproblem size is 256 . The versions deﬁned here are similar to the versionsintroduced in Section. 4.5, with the Baseline to be the non-optimized GPUversion,

Optimized V3 uses the pack/unpack in all the available dimensions andthe stencil-based communication method, and GPUDirect uses the P2P transfertechnology applied to both of these versions of the code. Within a node, we areusing GPUDirect P2P transfer between the memory of two GPUs on the samesystem/PCIe bus.

Table 3: Strong scaling comparison of diﬀerent GPU code versions using 2 GPUs (NewRiver)

GPU code versions Decompositions ssspnt EﬃciencySingle GPU (1,1,1) 93.8 100%Baseline (1,1,2) 145.9 77.8%Baseline (2,1,1) 22.0 11.7%Optimized V3 (1,1,2) 167.2 89.2%Optimized V3 (2,1,1) 169.8 90.5%Baseline + GPUDirect (1,1,2) 155.8 83.1%Baseline + GPUDirect (2,1,1) 154.7 82.5%Optimized V3 + GPUDirect (1,1,2) 177.9 94.9%Optimized V3 + GPUDirect (2,1,1) 179.4 95.7%Using MVAPICH2, the baseline code decomposed in the i direction performspoorly, about 1 / k direction. After a series ofoptimizations the ssspnt value changes from 22.0 to 169.8, indicating again theimportance of the coalesced memory access when doing host-device transfers.27PU direct P2P transfer on the baseline code is also able to avoid the cost ofhost-device transfers and is able to maintain an eﬃciency of 83% even thoughthe data is not contiguous. Combining the performance optimizations with theuse of GPUDirect can improve the eﬃciency to approximately 95% on 2 GPUs.Table. 4 shows the weak scaling performance comparison of diﬀerent GPUcode versions using 2 GPUs in the intra-node mode. The result also shows eitherthe optimizations proposed in this paper or GPUDirect (or both if applicable)should be used, if non-contiguous data transfers happen. It is also reasonableto see that the weak scaling generally performs better than the strong scaling. Table 4: Weak scaling comparison of diﬀerent GPU code versions using 2 GPUs (NewRiver)

GPU code versions Decompositions ssspnt EﬃciencySingle GPU (1,1,1) 93.8 100%Baseline (1,1,2) 161.6 86.1%Baseline (2,1,1) 21.4 11.4%Optimized V3 (1,1,2) 175.3 93.5%Optimized V3 (2,1,1) 176.3 94.0%Baseline + GPUDirect (1,1,2) 167.9 89.5%Baseline + GPUDirect (2,1,1) 154.9 82.6%Optimized V3 + GPUDirect (1,1,2) 180.0 96.0%Optimized V3 + GPUDirect (2,1,1) 180.9 96.5%

Since there are three diﬀerent MPI options(Open MPI, MVAPICH2 and MVAPICH2-GDR with GPUDirect turned on) onNewRiver and Cascades, scaling performance results using the three diﬀerentcompilers/options are given. Fig. 11 shows the strong scaling performance us-ing diﬀerent MPI options, respectively. Considering 3D growth is much morecommon in CFD such as applying systematic mesh reﬁnement so the 3D decom-position is of more interest. As mentioned earlier, when applying MVAPICH2 orMVAPICH2-GDR on Cascades, the performance drops to 1% if using 16 GPUs,28o the maximum number of GPUs used in that occasion is 8. Since the resultsare for the strong scaling, we cannot expect a very high eﬃciency if scaling upto a large number of GPUs. Using MVAPICH2-GDR generally achieves thebest performance especially when combined with

Optimized V3 but it showsa signiﬁcant performance drop when using 4 GPUs, which is more obvious onCascades. The performance curves using Open MPI on both platforms are muchsmoother than using MVAPICH2 and MVAPICH2-GDR.Also, a comparison can be made between

Optimized V2 and

Optimized V3 to address the importance of using more buﬀers for multi-GPU computing onmodern GPUs. Since

Optimized V3 generally uses more transfer buﬀers than

Optimized V2 , the performance can be much better if decomposition exists inmore dimensions. (a) Open MPI (b) MVAPICH2 (c) MVAPICH2-GDRFigure 11: Strong scaling performance across platforms (3D decomposition)

Weak Scaling Performance Results.

When measuring the weak scaling perfor-mance, grid growth type 2 is applied. Fig. 12 shows the weak scaling per-formance using diﬀerent MPI options. For each MPI option, results of usingdiﬀerent decompositions for diﬀerent grid growth are given. MVAPICH2 andOpen MPI perform equivalently but there are still some diﬀerences. MVAPICH2performs better than Open MPI for

Optimized V3 , and generally worse for

Opti-mized V2 , compared to Open MPI. It is reasonable as MVAPICH2 is designed toreduce communication overhead for complicated communication patterns, butthere is some overhead associated with this optimization. It can also be seenthat GPUDirect (with MVAPICH2-GDR) brings some performance beneﬁts and29erforms the best for both

Optimized V2 and

Optimized V3 . (a) OpenMPI (b) MVAPICH2 (c) MVAPICH2-GDRFigure 12: Weak scaling performance across platforms (3D decomposition) When overlapping communication and computation, every decomposed blockis further separated into two components: internal and external domains. Forlarge enough problems, the internal domain will have signiﬁcantly more gridpoints than the external domain. These internal points do not need data fromother blocks so they can compute their updates while the communication is oc-curring for the external portion of the block. After communication is ﬁnished,the external domain continues to ﬁnish the remaining computation. Overlap-ping will not reduce latency but it can hide the latency caused by inter-blockcommunication. In this paper, overlapping communication and computationwas applied to both CPUs and GPUs. Communication is always done on CPUswhile computation can be performed on CPUs or GPUs.Case studies to compare the overlap and non-overlap versions have beenmade on diﬀerent platforms, using diﬀerent decompositions and code versions,and for the strong and weak scaling performance. The overlap version performsmore slowly compared to the non-overlap version. Fig. 13 shows the strong andweak scaling performance for both the CPU and GPU on the NewRiver cluster,with a performance comparison between the overlap version (extended from

Optimized V3 ) and the non-overlap (

Optimized V3 ) version. For both the CPUand GPU, overlap performs about 20% to 30% slower than non-overlap up to 16processors, which was out of our initial expectation. The main reason is that the30synchronous progression is not supported well, potentially caused by the MPIand the communication system used. To ﬁgure out whether the asynchronousprogression engine was activated or not, we used NVIDIA Visual Proﬁler [25]to trace the program kernel executions on GPUs and found that the MPI useddoes not trigger communication until the code runs to a MPI WAITALL call,although communication is launched as early as possible. Since there is noactual overlap, and the non-overlap version only needs to setup the residualcalculation kernel once while the overlap version has to do the setup multipletimes (as it contains the internal domain and external domains), this overheadmakes the overlap slower than the non-overlap version. (a) CPU scaling (b) GPU scalingFigure 13: Overlap of communication and computation on NewRiver (3D decomposition)

In fact, the MPI standard [26] does not guarantee there is an actual overlap,which also means that the it may or may not be possible for communicationto make progress when control has returned to the application, depending onthe communication software and the underlying hardware. In Ref [27], it isalso concluded that the degree of actual overlap for an application depends onthe overlap potential of both the application and the underlying communicationsubsystem. In our case, we tested the overlap on diﬀerent Virginia Tech super-computing platforms using diﬀerent MPI options and diﬀerent decompositions,and none of them improved the performance using multiple GPUs. It should bementioned that the MPI standard allows for non-blocking operations to only be31rogressed to completion if a proper test/wait call was made. Thus, we tried toadd many MPI TESTALL (dips into the MPI progression engine many times)or similar calls for the GPU code right after communication initialization. Thismakes overlap slightly better (observed through tracing). However, some over-head is produced due to adding these wait calls, also the degree of overlap is stillnot fully complete. The beneﬁts of the performance enhancements are negligiblecompared with the overhead for our BDC code.Our conclusion is that overlapping communication and computation is not auniversal performance improvement for all applications and platforms, includingthe BDC problem using only MPI+OpenACC. Only if both the MPI compilerand the architecture supports asynchronous progression can overlap performwell and be used to hide some latency, which is diﬃcult. An alternative way ofimproving the overlap is using MPI+OpenACC+OpenMP, in which OpenMPis used to generate multiple threads. These threads can work on diﬀerent taskssuch as computation and communication so that the actual degree of overlap canbe increased [28, 29, 30, 31, 32]. In fact, there are more literature discussing howto improve the overlap performance and almost all of them use multiple threads.Therefore, developers who have an interest in the overlap version for their owncodes may need to do some simple tests ﬁrst and should not only depend onoverlap to get high performance. Since multi-threading is not a focus in thispaper, no in-depth investigation of multi-threading is applied in this paper.

5. Conclusions

It is shown in the paper that OpenACC directives oﬀer a convenient wayto accelerate a CFD code fast on multiple platforms. All the platforms cangenerally use the same code with little code intrusion, which is a big advan-tage over CUDA and OpenCL. Some general optimizations are examined toimprove the multi-GPU code performance, such as the pack/unpack methodand stencil-based communication method. The optimizations introduced areshown to be very eﬀective for both strong scaling and weak scaling, greatly re-32ucing communication overhead and latency cost on GPUs. Further optimiza-tions such as the overlap of communication and computation, asynchronousprogression, and the use of CUDA-aware MPI and GPUDirect are also im-plemented and discussed. Overlapping communication and computation usingonly MPI+OpenACC is shown to be not an eﬃcient way to improve the multi-GPU performance. GPUDirect is shown to be eﬀective in a CFD applicationlike the BDC code in this paper, as GPUDirect enables GPUs to communi-cate with each other directly and also increases the bandwidth between hostand device. This avoids overhead between host and device and is importantfor communication-bound problems. Also, a combination of the use of GPUDi-rect and the optimizations proposed in this paper can improve both the strongand weak scaling performance substantially. 3D domain decomposition gener-ally performs the best for the strong scaling on diﬀerent platforms. For weakscaling, which decomposition performs best depends on how the grid growth is.

Acknowledgements

The authors would like to thank Andrew J. McCall and Behzad Baghapourfor creating the original BDC code as well as giving advice, and thank CharlesW. Jackson for reviewing the paper and participating in various helpful discus-sions.

References [1] S. Jamshed, Using HPC for Computational Fluid Dynamics: A Guide toHigh Performance Computing for CFD Engineers, Academic Press, 2015.[2] B. Barney, OpenMP, 2018. URL: https://computing.llnl.gov/tutorials/openMP/ .[3] B. Barney, Message Passing Interface (MPI), 2019. URL: https://computing.llnl.gov/tutorials/mpi/ .334] R. Rabenseifner, G. Hager, G. Jost, Hybrid MPI and OpenMP Par-allel Programming, 2013. URL: https://openmp.org/wp-content/uploads/HybridPP_Slides.pdf .[5] GPUs for Scientiﬁc Computing, 2009. URL: https://people.maths.ox.ac.uk/gilesm/talks/bristol_hpc.pdf .[6] J. Luitjens, Global Memory Usage and Strategy, NVIDIA Corpo-ration, 2011. URL: https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GlobalMemory.pdf .[7] S. Memeti, L. Li, S. Pllana, J. Ko(cid:32)lodziej, C. Kessler, BenchmarkingOpenCL, OpenACC, OpenMP, and CUDA: programming productivity,performance, and energy consumption, in: Proceedings of the 2017 Work-shop on Adaptive Resource Management and Scheduling for Cloud Com-puting, ACM, 2017, pp. 1–6.[8] OpenACC-Standard.org, The OpenACC Application Programming In-terface, OpenACC-Standard.org, 2018. URL: .[9] S. Wienke, P. Springer, C. Terboven, D. an Mey, Openaccﬁrst experienceswith real-world applications, in: European Conference on Parallel Process-ing, Springer, 2012, pp. 859–870.[10] A. Sabne, P. Sakdhnagool, S. Lee, J. S. Vetter, Evaluating performanceportability of openacc, in: International Workshop on Languages and Com-pilers for Parallel Computing, Springer, 2014, pp. 51–66.[11] T. Hoshino, N. Maruyama, S. Matsuoka, R. Takaki, CUDA vs OpenACC:Performance case studies with kernel benchmarks and a memory-boundCFD application, in: Cluster, Cloud and Grid Computing (CCGrid), 201313th IEEE/ACM International Symposium on, IEEE, 2013, pp. 136–143.3412] J. Gong, S. Markidis, E. Laure, M. Otten, P. Fischer, M. Min, Nekboneperformance on gpus with openacc and cuda fortran implementations, TheJournal of Supercomputing 72 (2016) 4160–4180.[13] R. Searles, S. Chandrasekaran, W. Joubert, O. Hernandez, Mpi+ openacc:Accelerating radiation transport mini-application, minisweep, on heteroge-neous systems, Computer Physics Communications (2018).[14] B. P. Pickering, C. W. Jackson, T. R. Scogland, W.-C. Feng, C. J.Roy, Directive-based GPU programming for computational ﬂuid dynamics,Computers & Fluids 114 (2015) 242–253.[15] B. Baghapour, A. J. McCall, C. J. Roy, Multilevel parallelism for cfd codeson heterogeneous platforms, in: 46th AIAA Fluid Dynamics Conference,2016, p. 3329.[16] W. Xue, C. W. Jackson, C. J. Roy, Multi-cpu/gpu parallelization, op-timization and machine learning based autotuning of structured grid cfdcodes, in: 2018 AIAA Aerospace Sciences Meeting, 2018, p. 0362.[17] A. J. Chorin, A numerical method for solving incompressible viscous ﬂowproblems, Journal of computational physics 135 (1997) 118–125.[18] A. Ytterstr¨om, A tool for partitioning structured multiblock meshes forparallel computational mechanics, The International Journal of Supercom-puter Applications and High Performance Computing 11 (1997) 336–343.[19] J. Rantakokko, Partitioning strategies for structured multiblock grids, Par-allel Computing 26 (2000) 1661–1680.[20] Hokiespeed, 2017. URL: .[21] Newriver, 2019. URL: . 3522] Cascades, 2020. URL: https://arc.vt.edu/computing/cascades/ .[23] J. Kraus, An Introduction to CUDA-Aware MPI, 2013. URL: https://devblogs.nvidia.com/introduction-cuda-aware-mpi/ .[24] NVIDIA, NVIDIA GPUDirect, 2019. URL: https://developer.nvidia.com/gpudirect .[25] Proﬁler User’s Guide, NVIDIA Corporation, 2019. URL: https://docs.nvidia.com/cuda/profiler-users-guide/index.html .[26] MPI: A message-passing interface standard, 2015. URL: