[PDF] GPU Optimization for High-Quality Kinetic Fluid Simulation

Abstract

Fluid simulations are often performed using the incompressible Navier-Stokes equations (INSE), leading to sparse linear systems which are difficult to solve efficiently in parallel. Recently, kinetic methods based on the adaptive-central-moment multiple-relaxation-time (ACM-MRT) model have demonstrated impressive capabilities to simulate both laminar and turbulent flows, with quality matching or surpassing that of state-of-the-art INSE solvers. Furthermore, due to its local formulation, this method presents the opportunity for highly scalable implementations on parallel systems such as GPUs. However, an efficient ACM-MRT-based kinetic solver needs to overcome a number of computational challenges, especially when dealing with complex solids inside the fluid domain. In this paper, we present multiple novel GPU optimization techniques to efficiently implement high-quality ACM-MRT-based kinetic fluid simulations in domains containing complex solids. Our techniques include a new communication-efficient data layout, a load-balanced immersed-boundary method, a multi-kernel launch method using a simplified formulation of ACM-MRT calculations to enable greater parallelism, and the integration of these techniques into a parametric cost model to enable automated parameter search to achieve optimal execution performance. We also extended our method to multi-GPU systems to enable large-scale simulations. To demonstrate the state-of-the-art performance and high visual quality of our solver, we present extensive experimental results and comparisons to other solvers.

Full PDF

AACCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 1

GPU Optimization for High-QualityKinetic Fluid Simulation

Yixin Chen, Wei Li, Rui Fan, and Xiaopei Liu

Abstract —Fluid simulations are often performed using the incompressible Navier-Stokes equations (INSE), leading to sparse linearsystems which are difﬁcult to solve efﬁciently in parallel. Recently, kinetic methods based on the adaptive-central-momentmultiple-relaxation-time (ACM-MRT) model [1], [2] have demonstrated impressive capabilities to simulate both laminar and turbulentﬂows, with quality matching or surpassing that of state-of-the-art INSE solvers. Furthermore, due to its local formulation, this methodpresents the opportunity for highly scalable implementations on parallel systems such as GPUs. However, an efﬁcientACM-MRT-based kinetic solver needs to overcome a number of computational challenges, especially when dealing with complex solidsinside the ﬂuid domain. In this paper, we present multiple novel GPU optimization techniques to efﬁciently implement high-qualityACM-MRT-based kinetic ﬂuid simulations in domains containing complex solids. Our techniques include a new communication-efﬁcientdata layout, a load-balanced immersed-boundary method, a multi-kernel launch method using a simpliﬁed formulation of ACM-MRTcalculations to enable greater parallelism, and the integration of these techniques into a parametric cost model to enable automatedparameter search to achieve optimal execution performance. We also extended our method to multi-GPU systems to enablelarge-scale simulations. To demonstrate the state-of-the-art performance and high visual quality of our solver, we present extensiveexperimental results and comparisons to other solvers.

Index Terms —GPU optimization, parallel computing, ﬂuid simulation, lattice Boltzmann method, immersed boundary method (cid:70)

NTRODUCTION

Fluid simulations in computer graphics mostly rely onsolving the incompressible Navier-Stokes equations (INSE)using certain types of discretization schemes [3], [4], [5],[6], [7]. While there are a vast number of available solversfor INSE, including some which achieve relatively highaccuracy for turbulent ﬂows (e.g. [7], [8]), most methodsrequire solving large global sparse linear systems, such asthe pressure equation or equations resulting from implicitformulations for stability, making these solvers difﬁcultto efﬁciently parallelize, especially on GPUs or for high-resolution simulations. For this reason, local but accuratesolvers involving no global equations are desirable.Over the years, kinetic solvers based on the latticeBoltzmann equations (LBE) [9] have been considered asan appealing alternative for high-performance ﬂuid sim-ulations. Unlike many INSE solvers, kinetic solvers onlyrely on local schemes, and thus by nature are well suitedto parallelization. These methods are also conservative byconstruction, making them theoretically ideal for turbulentﬂow simulations. However, despite previous attempts [10],[11], [12], [13], kinetic solvers have seldomly been used incomputer graphics applications due to their poor accuracyand stability. One reason is that most previous works reliedon the BGK [14] or raw-moment multiple-relaxation-time(RM-MRT) [15], [16] models, which are known to exhibitnoticeable visual artifacts. The recent development of the • Yixin Chen, Wei Li, Rui Fan and Xiaopei Liu are all with the Schoolof Information Science and Technology (Shanghai Engineering ResearchCenter of Intelligent Vision and Imaging), ShanghaiTech University,China. E-mails: { chenyx2, liwei, fanrui, liuxp } @shanghaitech.edu.cn. adaptive-central-moment multiple-relaxation-time (ACM-MRT) model [1], [2] for kinetic solvers has helped overcomethese limitations, leading to an accurate and stable solverdemonstrating equal and sometimes superior visual qualitycompared to state-of-the-art INSE solvers (see Fig. 1), whilealso offering the potential for high computational efﬁciency.In addition, ACM-MRT-based kinetic solvers can be easilycombined with the immersed boundary (IB) method [17]to enable ﬂexible handling of boundaries with complexgeometries for both static and moving solids immersed inthe ﬂuid domain [2].Nevertheless, despite the advancements offered by theACM-MRT model, a straightforward GPU implementationof it does not produce high performance. There have beenmultiple attempts in the literature to improve the par-allelism of GPU-based kinetic solvers, mostly relying onoptimizations to data layout [18], [19] or the order of compu-tations [20]. However, these works targeted the traditionalBGK [21] and RM-MRT models [15], which typically cannotattain the visual quality needed for graphical ﬂuid simula-tions. Furthermore, to the best of our knowledge, there is nowork on GPU optimizations for enhancing the performanceof IB-based kinetic solvers to support the immersion ofcomplex dynamic solids in ﬂuids.In this paper, we aim to increase the performance ofkinetic solvers using the ACM-MRT model and IB methodon GPUs by addressing several key issues. First, when usingthe IB method in a kinetic solver, the data layout shouldbe designed to maximize the correspondence between spa-tial locality in the simulation domain and locality in GPUmemory. For this purpose, we propose a new parametricdata layout model which seeks to maximize data accesscontiguity. Second, a straightforward parallel implemen- a r X i v : . [ c s . G R ] J a n CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2

Fig. 1.

Simulation of turbulent smoke passing through complex architecture . With our GPU optimized kinetic solver on a multi-GPU system,we can efﬁciently simulate this high-resolution scenario (grid resolution 1200 × × tation of the IB method may introduce substantial loadimbalance due to large variations in the number of solidsamples around different ﬂuid nodes. We propose a new,more efﬁcient method to implement IB which parallelizesover solid samples instead of ﬂuid nodes to signiﬁcantly im-prove load balance. Finally, the ACM-MRT model performslengthy calculations which require a large number of regis-ters for each GPU thread, leading to reduced occupancy andexecution efﬁciency. We perform multiple kernel launchesand simplify the ACM-MRT mathematical expressions tomitigate these issues. These multifaceted optimizations areintegrated into a single parametric cost model, allowing usto perform an automated search in the parameter space toobtain the best execution performance. We also extend ourtechniques to multi-GPU systems and show that our solvercan easily and efﬁciently support high-resolution large-scalesimulations.In summary, we make the following key technical con-tributions to high-performance ﬂuid simulations using IB-based kinetic solvers with the ACM-MRT model: • A new parametric data layout to improve memoryperformance for IB-based kinetic solvers, improvingaccess contiguity for both ﬂuid and solid data. • An optimized parallel IB implementation with sep-arate domain boundary treatments to improve loadbalancing and reduce warp divergence. • Tuning the number of kernel launches and simplify-ing the mathematical formulation of the ACM-MRTmodel to improve GPU occupancy and executionefﬁciency. • An integrated cost model with automated parametersearch to produce the best settings for each simula-tion instance.With the above optimizations, our new implementationof a kinetic solver for the ACM-MRT model can run 5 to10 times faster than a baseline GPU implementation. It alsosigniﬁcantly outperforms state-of-the-art GPU-based INSEsolvers, being about an order of magnitude faster at thesame grid resolution and similar visual quality. Due to itsfaster rate of convergence, our solver can even producevisual quality similar to INSE solvers while using a signiﬁ-cantly lower grid resolution, which leads to a two order ofmagnitude improvement in performance. High-resolution ﬂuid simulations on multi-GPU systems can also be efﬁ-ciently implemented. Fig. 1 shows an example of a large-scale multi-GPU simulation at a resolution of × × .We also present a number of performance comparisons,analyses and visual animations to demonstrate the capa-bilities of our solver to produce high quality, robust ﬂuidsimulations at practically useful speeds. ELATED W ORK

Our work is related to high-performance ﬂuid simulationson the GPU. We ﬁrst review different ﬂuid simulationmethods for both INSE and kinetic solvers, and then reviewdifferent optimization and parallelization techniques.

There has been a vast number of INSE solvers devel-oped in the computer graphics community. Early worksstemmed from Stam [22] and used stable semi-Lagrangianadvection, but suffered from excessive numerical diffusion.This issue was progressively improved using higher-orderschemes [7], [8], [23], [24], [25], [26]. Another way to counter-act numerical diffusion is to use vortex-based methods [27],[28], [29], [30], [31], [32], [33], [34], where the velocity ﬁeldis corrected by vorticity distributions. Noise-based meth-ods [35], [36] are a cheap way to preserve ﬂow details, butthe resulting simulation may be non-physical and appearunrealistic. These approaches have recently been improvedusing neural network-based methods [37], [38], [39], [40].While the above works employed only uniform grids, non-uniform grids [4], [41], [42] or even tetrahedral meshes [3],[43], [44] have been used to perform adaptive computations.In addition to grids, INSE can be solved using particlemethods [5], [45], [46], [47], [48], with more recent tech-niques allowing for volume conservation [49], improvedincompressibility [50] and better boundary treatments [51],[52]. To retain the beneﬁts of both grids and particlesfor better advection and incompressibility, hybrid methodscombining particles and grids have been developed [6], [33],[53], [54], [55]. Note that particle-based methods withoutan auxiliary grid may not need global sparse linear systemsolvers, but they usually have difﬁculty simulating turbu-lent ﬂows unless a very large number of particles is used.

CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 3

Instead of solving INSE directly, kinetic methods solvethe more general Boltzmann equation, which approximatesINSE with different degrees of accuracy depending on howrelaxations in collisions is modeled. Early kinetic solversused the lattice BGK (or single-relaxation-time) model [14].To improve accuracy and stability, raw-moment multiple-relaxation-time (RM-MRT) model [15] was developed, butit had problems with turbulent ﬂows due to violation ofGalilean invariance, and leads to strong dispersion errors athigh Reynolds numbers. Recent improvements in relaxationmodeling include the cascaded relaxation [56], [57] andcentral-moment multiple-relaxation-time (CM-MRT) mod-els [58], [59], especially the non-orthogonal version [60],which much better respects Galilean invariance. However,these works still failed to address how to determine high-order relaxation parameters, which has been found impor-tant for turbulent ﬂow simulations. Based on observationsand analyses of the non-orthogonal CM-MRT model, Li etal. [1], [2] proposed adaptive schemes for relaxation param-eters based on local ﬂow characteristics. This model, calledACM-MRT, allows for much more accurate simulations ofboth laminar and turbulent ﬂows.When solids are immersed inside the ﬂuid domain,boundary conditions need to be satisﬁed. Bounce-backschemes have been used to enforce no-slip [61] or slip-ping [62] conditions. To improve accuracy, curved boundaryconditions were developed [63], which can also be extendedto support dynamic solids [64]. However, these methodsmay encounter serious problems for turbulent ﬂows whengrid resolution around the solid boundary is not sufﬁcientlyhigh. A more ﬂexible technique for boundary treatment isbased on the immersed boundary method [17], [65], [66],which can be used to handle both static and dynamic solidboundaries using penalty forces.

As mentioned earlier, most INSE solvers require solvinglarge global equations and are therefore difﬁcult to par-allelize. Parallel INSE solvers for multicore systems weredeveloped using OpenMP [67] and extended to multi-node cluster systems using MPI [68]. In computer graphics,Mashayekhi et al. [69] proposed a system called “Nimbus”,which automatically distributes grid-based and hybrid sim-ulations across cloud computing nodes for faster executionat higher grid or particle resolutions.INSE solvers can also be parallelized on the GPU. Earlyworks tried to map the entire computation into texturememory and use fragment programs to solve INSE [70],[71], [72]. With the advent of CUDA [73], optimized parallelsparse linear solvers became available [74] on single andmulti-GPU systems, making GPU implementations of manyINSE solvers much simpler [75], [76], [77], [78].Recently, GPU optimizations for the parallel material-point method (MPM) were proposed to accelerate a hy-brid ﬂuid solver [79]. In addition, an efﬁcient large-scaleﬂuid simulator on the GPU using the ﬂuid-implicit particle(FLIP) method [80] and a high-level, data-oriented program-ming language for sparse data structures (Taichi) [81] has been proposed. To utilize the computational advantages ofboth CPUs and GPUs, some works proposed parallel INSEsolvers on hybrid heterogeneous systems [82], [83], [84].

While INSE solvers are inherently difﬁcult to parallelize,kinetic solvers use a local formulation and thus are natu-rally scalable and well suited for implementation on highlyparallel GPUs. Similar to earlier parallelization methods forINSE solvers on the GPU, texture and render buffers wereinitially used to accelerate kinetic solvers [85], [86]. UsingCUDA, GPU implementations of kinetic solvers becamemuch simpler. However, a straightforward implementationof a kinetic solver does not achieve very good performance.One issue is memory coalescing, which is crucial for highGPU performance. The array-of-structures (AoS) data lay-out, which performs well on CPUs [87], was replaced by astructure-of-arrays (SoA) data layout on the GPU [20]. Thecollected SoA (CSoA) data layout, which combines AoS andSoA, was recently proposed [19] to further enhance perfor-mance and allow better caching when solid boundaries arecomplex or domain boundaries are irregular [18].To reduce non-coalesced memory accesses, the use of asingle kernel launch with swapping for both the stream-ing and collision stages of a simulation has been pro-posed [88]. Since non-coalesced reads require less time thannon-coalesced writes, the pull-in scheme was argued tobe better than the push-out scheme [89]. With the aboveoptimizations, kinetic solvers using different lattice models(D2Q9 [90], D3Q13 [20], D3Q19 [91], [92]) were proposedon the GPU, based mostly on the BGK or RM-MRT collisionmodels. In addition, to simulate larger scenarios, multi-GPUkinetic solvers were proposed using CUDA [93], [94] andOpenMP [95]. To our knowledge, there has been no workon GPU optimizations for kinetic solvers with the D3Q27lattice for either the BGK or RM-MRT collision models.

Our contributions.

Most of the GPU optimizations forkinetic solvers described above were for ﬂuids only. Whencomplex solids are present in ﬂuids and boundary condi-tions are enforced using the immersed boundary method,the previous techniques may lead to poor performancedue to sub-optimal data layout, load imbalance and warpdivergence. Furthermore, when the ACM-MRT model isused to achieve high visual quality, a straightforward im-plementation may result in low GPU occupancy and furtherdepress performance. This paper addresses these issues andpresents a simulator which is substantially faster than state-of-the-art GPU-based INSE solvers, while at the same timeproducing equal or higher visual quality.

OMPUTATIONAL F RAMEWORK

Before introducing our GPU optimizations in detail, weﬁrst brieﬂy review the computational framework employedfor ﬂuid simulations in this paper. Fluid ﬂows are usu-ally modeled using the Navier-Stokes (NS) equations, butcan alternatively be described by the Boltzmann equation(BE) [96], which is more general and can recover the NSequations under certain constraints [9]. BE can usually besolved by a set of lattice Boltzmann equations (LBE) [9],

CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 4

Fig. 2.

Lattice structure.

D3Q27 lattice structure, where c i ’s are dis-cretized microscopic velocities associated with the corresponding f i . which are local and conservative with respect to density andmomentum: f i ( x k + c i , t + 1) − f i ( x k , t ) = Ω i ( ρ, u ) + G i ( g ) . (1)Here, k is the position index of each grid node; t ∈ { , , ... } is the time step index, and i ∈ { , , ..., } is the indexof discrete particle velocities c i used in the D3Q27 latticestructure of the ACM-MRT model. Each c i is associatedwith a corresponding discretized distribution function f i ,and x k + c i is exactly located at a nearby node. Finally, Ω i and G i are the discretized collision and external force terms, andthe macroscopic density ρ and velocity u can be obtained bytaking the following moments w.r.t c i : ρ = (cid:88) i f i , u = 1 ρ (cid:88) i c i f i . (2)Due to its simple algebraic formulation, the solutionprocedure for LBE can usually be split into the followingthree steps, starting with streaming : f ∗ i ( x , t ) = f i ( x − c i , t ) . (3)Then, macroscopic quantities ( ρ and u ) are computed (Eq. 2)and boundary treatment is applied. Finally, we perform the collision step with an external force term G i : f i ( x , t + 1) = f ∗ i ( x , t ) + Ω i ( ρ ∗ , u ∗ ) + G i , (4)where ρ ∗ and u ∗ are computed by the streamed distributionfunctions f ∗ i after applying boundary treatments.Key to ensuring accurate and stable ﬂuid simulations forkinetic solvers is the formulation of Ω i . Here, we employthe ACM-MRT model [1], [2], which is constructed by arelaxation process with multiple rates: Ω = − M − DM ( f − f eq ) = − M − D ( m − m eq ) , (5)where for each grid node, f , f eq , m , m eq and Ω arevectors containing the assembly of distribution functions f i , their equilibrium f eqi , their corresponding distributionsin moment space m i and m eqi , and the collision results Ω i ,respectively. M is a 27 ×

27 moment-space projection matrix,and D is a diagonal matrix with the same dimensionscontaining different relaxation rates. Low-order relaxationrates are determined by physical parameters (e.g., kinematicviscosity ν ), while high-order relaxation rates should bedetermined adaptively, balancing local dissipation and dis-persion errors (see [1] and supplementary document in [2]).

1. See Fig. 2 for an illustration, and the supplementary document for thespeciﬁc values of c i . Fig. 3.

Surface sampling used for the immersed boundary method.

We employ Poisson-disk sampling to generate uniformly distributedsolid samples over the surface, which are used to enforce boundaryconditions at the solid boundary. Note that the samples here are forillustration only; in practice the sampling should be much denser.

When solids are immersed inside ﬂuids, boundary con-ditions need to be satisﬁed. The traditional method forboundary treatment in kinetic solvers is to use bounce-backschemes [12], [97]. However, when dealing with curved ormoving solids, these schemes are either inaccurate or inﬂex-ible. An alternative method which balances ﬂexibility andaccuracy for boundary treatment is the immersed boundary(IB) method [2], [17], [65], [66] we employ in this paper,which usually follows three main steps: • Solid surface sampling . First, we uniformly samplethe solid surface using Poisson-disk surface sam-pling [98], see Fig. 3. The sampling should be denseenough such that each ﬂuid cell around the solidboundary contains 10 to 100 samples (depending onthe local geometry of the solid) to ensure sufﬁcientaccuracy. • Force computation at solid samples . A solid sample x s may not be located exactly at a ﬂuid grid node (seeFig. 4 for a 2D illustration). Thus we ﬁrst interpolateﬂuid velocities u ( x s ) from nearby ﬂuid nodes (cf. thegreen box in Fig. 4) using a kernel function K ( · ) : u ( x s ) = (cid:88) i ∈ N ( x s ) K ( (cid:107) x s − x if (cid:107) ) u ( x if ) . (6) u ( x s ) may be different from the desired boundaryvelocity u b ( x s ) , so a penalty force should be appliedto the ﬂuid at the solid sample location x s : g s → f ( x s ) = ρ [ u b ( x s ) − u ( x s )] . (7) • Force spreading to ﬂuid nodes . Since g s → f ( x s ) is ap-plied at the solid sample, we need to spread it to thenearby ﬂuid nodes x f using the same kernel: g s → f ( x f ) = (cid:88) i ∈ N ( x f ) K ( (cid:107) x is − x f (cid:107) ) g s → f ( x is ) . (8)We use the smallest 2 × × × CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 5

Fig. 4.

Implementing the immersed boundary method.

We illustratetwo ways to compute the penalty force at each ﬂuid node bordering thesolid boundary. The ﬂuid velocity is ﬁrst interpolated from the neighbor-ing ﬂuid nodes (green box). Then to compute the penalty force at eachﬂuid node: 1) boundary samples are ﬁrst found in a region around theﬂuid node, possibly using a grid accelerated data structure, then theforces are summed with a spreading kernel to form the force at the ﬂuidnode (blue box); or 2) starting from boundary samples, their penaltyforces are added directly to the nearby ﬂuid nodes (red box).

With the procedures described above, we can produce anon-optimized baseline GPU implementation of an IB-basedkinetic solver with the ACM-MRT model as follows: • Initialization.

Before starting the simulation, we allo-cate global memory on the GPU for the f i values attimes t and t + 1 , and for the density ρ , velocity u and external force g at time t , for each ﬂuid node inthe simulation. Vector ﬁelds (e.g., f and u ) are storedas linear arrays of vectors. • Streaming.

At each simulation time step t , we paral-lelize over the ﬂuid nodes, and advect f i from theneighbors of each ﬂuid node (cf. Eq. (3)) by readingand writing to GPU global memory. • Calculating macroscopic ﬁelds.

After streaming, weagain parallelize over the ﬂuid nodes and computethe density ρ and velocity u at each node (cf. Eq. (2)). • Boundary treatment.

We employ the IB method forboundary treatment. When computing penalty forcesat solid samples (cf. Eqs. (6) and (7)), we parallelizeover the sample points and access the nearby ﬂuidnodes (cf. the green box in Fig. 4). However, whenspreading forces (cf. Eq. (8)), we parallelize over theﬂuid nodes in the bounding box(es) of the solidobject(s) (cf. the blue box in Fig. 4) and access thenearby solid samples. Note that this parallelizationprocedure directly follows the order of the equationsdeﬁning the IB method. • Collision.

After boundary treatment, we perform col-lisions using Ω i and the force term G i (cf. Eq. (4)),which is done by parallelizing over the ﬂuid nodes.We perform all collision computations in a singlekernel, in keeping with previous GPU-based latticeBoltzmann implementations. • Update.

Finally, with Ω i and G i computed, we update f i and move to time step t + 1 .Lastly, note that the matrix-based computations during thecollision step (cf. Eq. 5) are typically implemented by fully expanding the computation into arithmetic operations be-tween scalar values, rather than through calls to matrixmultiplication subroutines. IGH -P ERFORMANCE

GPU O

PTIMIZATIONS

The baseline GPU implementation described in Sec. 3.2 isexpected to be highly scalable due to the local formulationof the kinetic solver. However, as we discussed in Sec. 1,a number of optimizations can be performed to furtherimprove its speed by more than an order of magnitudein certain simulation scenarios. In this section, we presenta number of optimization techniques, and also show howto integrate these techniques into an automated optimalparameter searching framework to enable fast ﬂuid simu-lations.

Before introducing our GPU optimizations in detail, we ﬁrstdescribe how threads are usually assigned in GPU-basedkinetic ﬂuid simulations containing immersed solids. De-pending on which part of the simulation is being performed,a GPU thread can be assigned to perform the work at eithera ﬂuid node or a solid sample. As ﬂuid nodes are typicallylaid out in a 3D grid, we use 3D thread blocks to indexthreads assigned to the ﬂuid nodes. We ensure that the sizeof the ﬁrst dimension of the thread block is a multiple of theGPU warp size (typically 32), while the sizes of the otherdimensions are user tunable and depend on the type ofGPU being used. Solid samples, while used to describe a 3Dsurface, are typically speciﬁed using a 1D index. Therefore,we also use 1D indices, possibly with a permuted ordering,for threads assigned to solid samples.

A number of previous works on GPU optimized kineticsolvers [18], [100], [101] have pointed out the importance ofdata layout on simulation performance. GPU main memoryis laid out in segments of consecutive addresses, typically128 bytes in size. GPU threads execute in units called warps ,and a warp of threads accessing data from a single memorysegment achieves much higher performance than the warpaccessing data from multiple segments. Thus, an importantprinciple in GPU memory optimization is memory coalescing ,i.e., ensuring a warp of threads access contiguous, segment-aligned memory locations.

We ﬁrst look at how data for ﬂuid nodes should be laidout in GPU memory. Fluid data can be described using anumber of vector ﬁelds, e.g., for different f i ’s, u and g .Collections of vectors are usually stored in one of the twoways, as an array-of-structures (AoS) [87], where each ele-ment in an array is a structure storing all vector componentsof a ﬂuid node (cf. Fig. 5 (top) for an AoS layout of the f i vectors), or as a structure-of-arrays (SoA), where eachvector component of the ﬂuid nodes over the entire domainis ﬁrst grouped together and then stored consecutively in anarray (cf. Fig. 5 (middle)). Recall that we assign one thread toprocess each ﬂuid node. During the streaming and collision CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 6

Fig. 5.

Different data layouts of ﬂuid nodes in memory.

AoS: All f i values from a node are stored consecutively, followed by f i values ofsubsequent nodes. SoA: For each i ∈ { , . . . , } , f i values for all thenodes are stored consecutively. CSoA: For each i ∈ { , . . . , } , the f i values of α nodes, for some choice of α , are stored consecutively.Fig. 6. Bounce-back scheme for a non-slip boundary.

During stream-ing at each ﬂuid node, some nearby grid nodes may be located insidethe solid region (e.g. the purple nodes in (a) and (b)), and the distributionfunctions (gray arrows) are unknown and need to be reconstructed. Thebounce-back scheme uses the corresponding distribution functions inthe opposite directions (arrows with the same color in (b)) to ﬁll in theseunknown distribution functions, ensuring the no-slip condition. Since thesolid boundary location around a ﬂuid node may be arbitrary, conditionalstatements are needed in the kernel code to identify which set of nearbygrid nodes are inside solid regions and require bounce-back treatment.Since different kernel threads may follow different conditional branches,warp divergence may occur to slow down the simulation. steps of the simulation, threads need to access f i valuesfrom neighboring nodes. To coalesce these memory accesses,many works adopt the SoA layout for ﬂuid data [20], [93].Recent works have proposed the CSoA data layout [18],[19] to further improve coalescing and caching effects. Let α denote the number of ﬂuid nodes that we collect into agroup (cf. Fig. 5 (bottom)) and β the number of componentsthat each ﬂuid node vector contains (e.g., β = 27 for f inthe D3Q27 model). Then the i ’th component of the vectorfor the k ’th ﬂuid node is stored at the following memorylocation: βα (cid:98) kα (cid:99) + αi + k mod α. (9) When solid objects are immersed inside the ﬂuid domain,boundary conditions need to be satisﬁed at the solid sur-face. As stated in Sec. 1, we use the IB method to enforceboundary conditions instead of the traditional bounce-backscheme. This is because the bounce-back scheme (as shownin Fig. 6 and Alg. 1) requires each grid location to be labeledas either solid or ﬂuid, leading to conditional identiﬁcationof the distribution functions and warp divergence, whichslows down the simulation. The IB method on the otherhand only uses penalty forces around solid boundaries,which allows overlap of solid and ﬂuid regions and doesnot require conditional identiﬁcation of solid nodes, leadingto better parallel performance.The main problem we encountered when dealing withsolids is that solid samples may initially be stored in anarbitrary order in GPU memory, depending on how thesolid surface was triangulated or how samples within each

Algorithm 1

Bounce-back scheme in kinetic solver if a grid node at x is ﬂuid node then for each direction c i of the node do if the grid node at x − c i is solid node then set f ∗ i ( x , t ) = f ∗ i (cid:48) ( x , t ) ; (cid:46) f ∗ i (cid:48) is opposite to f ∗ i . end if end for end if (cid:46) Bounce-back scheme on f ∗ i after streaming. Fig. 7.

2D illustration of the data layout for ﬂuid nodes and solidsamples in memory.

To store solid samples, we (a) ﬁrst ﬁnd thebounding box of the solid, and then (b) divide the box into blocks ofsize (cid:96) × (cid:96) (in 3D we use (cid:96) × (cid:96) × (cid:96) ). These blocks are stored in the sameorder as the ﬂuid nodes, as shown by the zig-zag black arrows. Withineach block, solid samples are ﬁrst stored in the order they are sampledbased on the original triangulation. To further improve memory accessperformance, these solid samples are Morton-coded and stored in Z-order, where the shading of each sample in (c) represents the orderingof the samples in memory (lighter colors indicate smaller indices). Theﬂuid nodes are stored in the CSoA format, with interpolation parameter α specifying the number of repeating f i ’s, i.e. the number of ﬂuid nodesin each group. triangle were ordered. Recall from Sec. 3 that velocity isinterpolated at each solid sample from nearby ﬂuid nodesand penalty forces are spread to ﬂuid nodes from solidsamples. Computing these quantities requires solid samplesto access nearby ﬂuid data, and hence the order of the solidsamples affects the memory access pattern and performance.Note that this behavior has not been previously discussedin the literature.To alleviate the above problem, we observe that byprocessing nearby solid samples using consecutive threads,it may be possible to improve cache hit rates for ﬂuiddata accesses and improve memory performance. Based onthis idea, we deﬁne a parameter (cid:96) and partition the ﬂuidgrid around the solid into blocks of size (cid:96) × (cid:96) × (cid:96) , orderingthese blocks by the natural x − y − z order. We then orderthe solid samples based on the block they lie in, so thatsamples from an earlier block are stored before samplesfrom a later block. Within each block, we order the samplesusing Z-ordering, by computing the Morton code [102] foreach sample and sorting the codes within the block (cf.Fig. 7(c)). We determine the optimal (cid:96) using our parametriccost model, as described later in Sec. 4.5. Note that this isa new data layout for improving coalescing, and has notbeen proposed in previous IB-based implementations of thekinetic method.To explain the utility of our proposed sample ordering CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 7

Fig. 8.

2D illustration of caching for different block sizes.

Assumeeach warp has 8 threads. In both (a) and (b), the black points representsolid samples processed by threads in one warp, with one thread as-signed to each point. When (cid:96) = 1 as in (a), the points are dispersed,whereas for (cid:96) = 2 as in (b) the points are more localized. Now, assume3 ﬂuid nodes are stored in each memory segment. In (a) and (b), eachrectangle marks the ﬂuid nodes stored in one memory segment. In (a),four memory segments need to be read to load the ﬂuid data needed bythe solid samples, while in (b) only three segments are needed due togreater sample locality. This shows how varying (cid:96) can improve memorybandwidth and caching performance. for improving memory performance, we refer to Fig. 8 fora 2D illustration. Assume in this example that a GPU warpcontains 8 threads. In Fig. 8(a) we use (cid:96) = 1 whereas inFig. 8(b) we use (cid:96) = 2 . The black points in both subﬁguresshow the solid samples processed by one warp, with onethread assigned to each sample. Note that due to the larger (cid:96) , samples are more localized in Fig. 5(b). Now, assume that3 ﬂuid nodes are stored in each segment of GPU memory.Fig. 8(a) shows that 4 memory segments, marked by theorange rectangles, need to be read to load the ﬂuid dataneeded by the warp of solid samples when (cid:96) = 1 , whereasin Fig. 8(b) only 3 memory segments need to be read for (cid:96) = 2 . This shows that differences in the block size cancause variations, sometimes quite signiﬁcant, in the amountof memory bandwidth consumed and cache hits obtained.

As stated earlier, the use of the IB method removes the con-ditional branching which occurs during boundary treatmentin the bounce-back schemes used in many kinetic solvers.However, a baseline implementation of the IB method re-quires different numbers of loop iterations for differentﬂuid nodes around the solid boundary, with the variationoften being quite large, and this leads to substantial loadimbalance between the threads. In this section, we proposeadditional changes to the traditional IB implementation toimprove load balancing.

We ﬁrst show how a baseline implementation of the IBmethod can result in load imbalance among threads. Con-sider the force spreading step described in Sec. 3.1 (cf.Eq. (8)). Here, threads are assigned to ﬂuid nodes and thepenalty force at a ﬂuid node is computed by summingtogether all the penalty forces, computed immediately afterinterpolation by Eq. 6, at nearby solid samples (this stepis also called “gathering” [103]), weighted by a kernel K in a 2 × × × × × number of loops time (ms)min. max.gathering with grid acceleration 2 754 53.9scattering with atomic addition 8 27 6.1 TABLE 1Comparison of the number of loop iterations using “gathering” and“scattering”, and the improvement in performance. resulting in large variance in the number of samples in theforce summation at different ﬂuid nodes (see Table. 1 forthe variation in the number of loop iterations in an examplecomputation). This problem is especially severe when thesolid geometry is complex and local geometric roughnesschanges with location.

While we saw above that parallelizing across ﬂuid nodescan result in substantial load imbalance, the imbalance canbe signiﬁcantly reduced if we instead parallelize over thesolid samples. This technique is also called “scattering”,and has been used in other GPU-based particle simula-tions, e.g. [103]. In particular, we assign one GPU threadto process each solid sample. In the interpolation process,each sample interpolates ﬂuid velocities from nearby gridnodes, while in the force spreading process, each solidsample redistributes its penalty force to nearby ﬂuid nodes,as shown in the red box in Fig. 4. Eq. (8) is equivalentto adding a penalty force with the corresponding kernelweight K ( (cid:107) x s − x if (cid:107) ) g s → f ( x s ) to the ﬂuid velocities of allgrid nodes u ( x if ) in the same 2 × × ×

2) neighbor-hood. While the number of ﬂuid nodes still varies for eachsolid sample, the variation is much less than the variation inthe number of solid samples in each neighborhood, leadingto greatly reduced load imbalance (see Table. 1 for thereduction in load imbalance using scattering). Note thatsince multiple solid samples may try to add forces to thesame ﬂuid node, we use atomic additions to avoid raceconditions.

The method described above signiﬁcantly reduces load im-balance for the IB method at solid boundaries. However,special treatment is still needed for the domain bound-ary. Since the IB method is difﬁcult to apply at domainboundaries, at these locations we resort to the traditionalbounce-back scheme. Recall that the main drawback of thismethod is that it causes warp divergence due to conditionalbranching. To mitigate this effect, we perform bounce-backusing 6 different kernels for the 3D case (4 different kernelsfor 2D case), one for each face of the domain boundary. Foreach face, threads only need to check one condition, namelywhether they border the face being processed, while thedistribution functions which need bounce-back treatmentare known in advance and do not need to be determinedusing conditionals. Overall, performance is improved usingmultiple kernel launches instead of a single one.

GPU occupancy refers to the number of active GPU threads,which should be maximized in order to make full use of

CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 8 the processing capabilities of the GPU. One factor whichlimits occupancy is the amount of resources, such as regis-ters or shared memory, needed by each thread. The ACM-MRT-based kinetic solver we employ produces signiﬁcantlyhigher quality results compared to other kinetic methods.However, this comes at the cost of needing to perform verylengthy arithmetic computations in the collision process,which in turn requires a large number of GPU registers andresults in low occupancy. Thus, to improve performance, itis important that we relieve register pressure.

One straightforward strategy for reducing the number ofrequired registers is to separate the most complex part ofthe collision computation, namely the calculation of the Ω i ( i ∈ { , . . . , } ) values, into multiple kernel launches.That is, each kernel launch only computes a portion of the Ω i values. As we increase the number of kernel launches,the registers needed for each thread decreases. However,each kernel launch requires reading and writing partiallycomputed f i values from and to the GPU main memory.Thus, a tradeoff exists between the higher occupancy en-abled by a larger number of kernel launches and the cost ofa larger number of memory accesses. In practice, we foundthat using two kernel launches typically achieved the bestbalance and highest performance in our GPU system. In addition to separating the computation of different Ω i ’sinto multiple kernels, another strategy to reduce the numberof required registers is to simplify the lengthy arithmeticexpressions used in computing collision. The speciﬁc formof the expression for one of the collision terms used in theACM-MRT model is given in the supplementary document and will not be covered in detail here. However, a keystep involves the computation of M − , an example ofwhich is given by the expression shown below. Note thatthe following formula only shows about one sixth of thecomplete expression for Ω . Ω = ˜ f / − ˜ f / f / f / · · · + ( ˜ f u y ) / − ( ˜ f u z ) / f u z ) / · · · + ( ˜ f u z ) / f u x ) / − ( ˜ f u x ) / · · ·− ( ˜ f u y ) /

16 + ( ˜ f u z ) / − ( ˜ f u z ) / · · ·− ( ˜ f u x u z ) /

12 + ( ˜ f u x u z ) / · · ·− ( ˜ f u x u y ) / f u x u y ) / f u x u z ) / · · · + ( ˜ f u x u z ) / f u y u z ) / f u y u z ) / · · ·− ( ˜ f u x u y ) / − ( ˜ f u x u y ) / − ( ˜ f u x u y ) / − ( ˜ f u x u y ) /

24 + ( ˜ f u x u y ) /

12 + ( ˜ f u x u y ) / · · ·− ( ˜ f u x u z ) / − ( ˜ f u x u z ) / f u x u z ) /

12+ ( ˜ f u x u z ) / − ( ˜ f u x u z ) / − ( ˜ f u x u z ) / · · ·− ( ˜ f u y u z ) / − ( ˜ f u y u z ) / − ( ˜ f u y u z ) /

24 + · · ·− ( ˜ f u y u z ) / − ( ˜ f u y u z ) / − ( ˜ f u y u z ) / · · · (10)Here, ˜ f i = [ DM ( f − f eq )] i , and [ · ] i picks the i ’th elementout of a vector. Computing this and similar expressionsrequires dozens of registers for each GPU thread. This can be Fig. 9. An example of a jet ﬂow passing over a slanted airplane obstaclewas used in the performance comparisons in Figs. 11 to 15. u y ... u x u y ... u x u y u z Ω

2( ˜ f − ˜ f ) ... − ˜ f + ˜ f ... − ρ Ω − ˜ f − ... + ˜ f ... − . ρ + ... − . f ... . ρ ... ... ... ... ... ... Ω . f − ... − . f ... . f + ... + 0 .

25 ˜ f ... − . ρ Ω − . f + ... − . f ... . f + ... + 0 .

25 ˜ f ... − . ρ ... ... ... ... ... ... TABLE 2Example of monomials and their corresponding weights used incomputing M − in the ACM-MRT collision model. reduced with the observation that the expressions for differ-ent Ω i ’s share many similar terms, with the only differencebeing the coefﬁcients. In general, all the terms appearingin an expression can be written as the sum of monomialsof the form ω ( ρ ) (cid:81) α =0 u p α α ( p α = 0 , , is the order ofthe α ’th component). Different Ω i ’s may have differentcombinations of orders with different weights ω ( ρ ) . Withineach Ω i , any term having the same order of p , p , p butdifferent weights ω ( ρ ) can be merged, as shown in Table 2.Across different Ω i ’s, we ﬁrst compute the values of theshared terms and store them. Later, we simply read backthe values needed for each Ω i . Based on our experiments for the test case shown in Fig. 9,using a grid resolution of × × with 45,127 solidsamples, collision performance is improved by around 35%using a combination of multiple kernel launches and sim-pliﬁed collision terms, while arithmetic expression simpliﬁ-cation alone improves performance by 6%. In the preceding subsections, we described a number ofdifferent GPU optimizations for improving the performanceof our kinetic solver. While the optimizations may appearto be largely independent, they are in fact interconnected.In particular, the solver’s performance is strongly affectedby the memory layout, which in turn is determined by theparameters (cid:96) , giving the size of blocks that solid samplesare grouped into, and α , giving the granularity of the CSoAlayout for ﬂuids. We therefore combine these parametersinto a metric H t ( (cid:96), α ) measuring the cost to simulate a timestep starting from time t . Note that H t ( (cid:96), α ) may vary fordifferent t , as the immersed solid may lie at different orien-tations over time, leading to different costs to implement CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 9

Fig. 10.

Performance variations from changing model parameters for different solid geometries . Given a simple geometry such as the ballin (a), the simulation cost does not change much as we vary the sample block size (cid:96) and the CSoA parameter α within a speciﬁc range ( (cid:96) ∈ [2 , , α ≥ ). However, for more complex geometries such as the dragon, the cost can vary signiﬁcantly based on (b) the degree of concavity in thegeometry compared to (a) and (c) the number of solid samples compared with (b) ((c) contains around 10 times more solid samples than (b)). Thisindicates that searching for the optimal (cid:96) and α is important for complex geometries containing a large number of solid samples. the IB method. To obtain more stable measurements, weconsider measuring the average cost over a time interval [ t s , t e ] . We frame this as searching for optimal values for (cid:96) and α such that the following cost is minimized:argmin (cid:96),α N t s → t e t e (cid:88) t = t s H t ( (cid:96), α ) , (11)where N t s → t e is the number of time steps between t s and t e . Note that we may use different numbers of time steps inorder to obtain optimal parameters balancing different casesin a simulation. For static solids, N t s → t e is typically small(e.g. N t s → t e = 10 ), while for dynamic solids, N t s → t e shouldbe large enough to cover many different orientations of thesolid. Since translation of solid usually does not stronglyaffect the objective, we consider N t s → t e only when rotatingthe solid object by an angle of π in 3D, leading to a valueof around 500 for N t s → t e . This minimization is performedonce before the start of every simulation with a differentsetting for the grid resolution, geometry shape, etc.To minimize Eq. (11), we enumerate all possible com-binations of (cid:96) and α and search for the combinations withminimal cost. Since (cid:96) is the side length of the subdividedblock, it is upper bounded by the smallest edge length L m of the solid’s bounding box, and thus we search for (cid:96) ∈ { , , . . . , L m } . For the CSoA parameter α , we onlyconsider powers of 2 since we want α to be divisible by ordivisible into the thread warp size of 32. We thus search for α ∈ { , , . . . , (cid:98) log N f (cid:99) } , where N f is the total number ofﬂuid nodes.Fig. 10 shows the simulation cost in seconds for certainvalues of (cid:96) and α and for different types of geometries. Wenote that after minimizing Eq. (11), we can obtain the bestparameters for the scenario we want to simulate. But if thescenario changes, e.g. if we use a different size for the ﬂuiddomain, a different solid geometry, or if the solid is sampledwith a different number of points, we should perform theminimization again. The cost of the procedure is usuallynot excessive, e.g. 5 to 15 minutes at a grid resolution of × × , compared to a total simulation time ofseveral hours. As the simulator can sometimes be severaltimes faster using a good parameter setting compared to a Fig. 11. Comparison of streaming performance with (blue bar) andwithout (red bar) our optimizations.Fig. 12. Comparison of performance for calculating macroscopic vari-ables with (blue bar) and without (red bar) our optimizations. poor one, it is usually well worthwhile to perform optimalparameter search before starting a simulation.

We summarize our entire optimized kinetic solver in Alg. 2.In the rest of this section, we analyze the performance ofdifferent parts of the solver. We compared the performanceof our optimized solver with the baseline solver described inSec. 3.2, which is for the most part a faithful implementationof the algorithm described in [1], except that [1] did notuse the immersed boundary method. To make our analysisas detailed as possible, we decomposed our solver intoits main subroutines (cf. Sec. 3.2), consisting of streaming,macroscopic ﬁelds calculation, the IB method, collision, anddomain boundary treatment. We analyzed each of thesecomponents using key GPU performance metrics includingthread occupancy, memory load/store efﬁciency, warp di-vergence and warp execution efﬁciency.Figs. 11 to 15 show performance comparisons of differentparts of our optimized solver and the baseline solver usingthe scene setup shown in Fig. 9. We see that our param-eterized data layout signiﬁcantly improves memory accessefﬁciency. Occupancy was also improved several fold dur-ing the IB and collision computations, while warp execution

CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 10

Algorithm 2

Our GPU-optimized IB-based kinetic solver // Find optimal parameters for (cid:96) and α . SearchOptimalParameters( (cid:96) , α ); (cid:46) Sec. 4.5ReorganizeMemory( (cid:96) , α ); (cid:46) Eq. (9) for arrays of vectors while time step ≤ maximum time step do // (a): Perform parallel streaming. (cid:46)

Sec. 3.2 parallel for each ﬂuid node x f dofor each i ∈ [0 , of node x f do j = RemapIndex( x f , i ); (cid:46) Eq. (9) j n = RemapIndex( x f − c i , i ); (cid:46) Eq. (9) f t [ j ] = f t − [ j n ]; (cid:46) Eq. (3) end forend for // (b): Compute macroscopic variables in parallel. (cid:46)

Sec. 3.2 parallel for each ﬂuid node x f do UpdateDensityAndVelocity( x f ); (cid:46) Eq. (2)ResetForce: g [ x f ] = ; end for // (c): Immersed boundary treatment. (cid:46)

Sec. 4.3 parallel for each solid sample x s do ComputePenaltyForce( x s ); (cid:46) Eq. (7) end forparallel for each solid sample x s do SpreadPenaltyForce( x s ); (cid:46) Eq. (8) (atomic addition) end for

ParallelTotalForceAndTorqueCalculation();ParallelSolidSampleUpdates(); // (d): Perform parallel collision. (cid:46)

Sec. 4.4 parallel for each ﬂuid node x f do Compute m [ x f ] and m eq [ x f ]; end forparallel for each ﬂuid node x f do Compute Ω − [ x f ] and add G − [ x f ]; (cid:46) Eq. (5) end forparallel for each ﬂuid node x f do Compute Ω − [ x f ] and add G − [ x f ]; (cid:46) Eq. (5) end for // (e): Process domain boundary in parallel. (cid:46)

Sec. 4.3.3ParallelLeftBoundaryTreatment();ParallelRightBoundaryTreatment();ParallelUpBoundaryTreatment();ParallelDownBoundaryTreatment();ParallelForwardBoundaryTreatment();ParallelBackBoundaryTreatment(); end while

Fig. 13. Comparison of immersed boundary computation performancewith (blue bar) and without (red bar) our optimizations. efﬁciency was improved by a factor of three during the IBcomputation. The overall performance was improved by afactor of 5 to 10 compared to the straightforward baselineimplementation.

Fig. 14. Comparison of collision performance with (blue bar) and without(red bar) our optimizations.Fig. 15. Comparison of domain boundary performance with (blue bar)and without (red bar) our optimizations.

The optimizations described in the previous sections wereperformed on a single GPU, limiting the resolution at whichthe kinetic solver can simulate. To support high resolutionscenarios for large-scale simulations, we modify the algo-rithm to run in parallel on multiple GPUs. Since the kineticsolver uses only local information from one-hop neighbors,we can also easily extend the optimizations techniquesdescribed earlier to multi-GPU systems.We consider a setting where m GPUs (in our case m = 4 )are installed on a single cluster node (the multi-node caseis more complex and we leave it for future work). Weﬁrst subdivide the ﬂuid nodes in the simulation domaininto multiple regions along one dimension (e.g. the z -dimension), as shown in Fig. 16, while ensuring that eachregion contains roughly the same number of ﬂuid nodes. Foreach pair of neighboring regions, we extend both regions bya one-cell-wide “ghost region” (marked in red in Fig. 16),holding data from the shared face (marked in blue) with itsneighbors. Data is updated (copied) between neighboringGPUs after every simulation time step. This communicationcan be done using GPU to GPU memory transfer whensupported by hardware. However, due to limitations ofour current hardware and for the sake of implementa-tion portability, our current implementation performs datatransfer through the CPU. Note that this may substantiallyreduce performance, especially for large m , and that withfaster inter-GPU communication mechanisms all the resultsdescribed below are expected to improve. The optimizationsdescribed earlier can be performed independently withineach region. When immersed solid does not move, we cansubdivide solid samples into m groups based on regionsthey lie in. However, when the solid is allowed to move,its sample points may reside in different regions over time.To deal with this situation, we store all solid samples inevery GPU, thus avoiding the need to copy solid samplesamong GPUs, while incurring the cost of a slight increase inmemory usage. ESULTS AND D ISCUSSIONS

We implemented the single GPU version of our solver on anNVIDIA TITAN XP GPU with 12 GB of memory, installed

CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 11

Fig. 16.

Multi-GPU simulation.

The whole ﬂuid domain is subdividedinto m ( m ≤ in our case) regions of roughly equal size along the z -dimension. We extend the shared face between neighboring GPUs byone cell along the z -direction and copy data from the neighboring GPUs(the blue regions), with arrows indicating the direction of copying. n x , n y and n z are numbers of ﬂuid grid nodes in the x -, y - and z -directions.Fig. 17. Accuracy comparison for kinetic solvers using differentcollision models . Visualization of velocity magnitude from a cross-section of a 3D velocity ﬁeld for a jet ﬂow impinging on a ball obstacle:(a) using the BGK model with the smallest stable viscosity ( × − ),(b) using the MRT model with the smallest stable viscosity ( − ), and(c) using the ACM-MRT model with the same viscosity as (b). The colorbar shows the ranges of velocity magnitudes in LBE scale. on an 8-core Intel Xeon E5-2650 workstation with 64 GBof memory. For the multi-GPU version of our algorithm,we used a system with 4 NVIDIA Tesla P40 GPUs pernode, each having 24 GB of memory. The system also had 4Intel Xeon E5-2620 CPUs (8 cores per CPU) with 384 GB ofmemory. Optimizing model parameters, as described in Sec.4.5, usually took 2 to 5 minutes for ﬂows with static solidsand 5 to 15 minutes for ﬂows with dynamic solids at a gridresolution of × × . To render the simulated ﬂows,we usually injected smoke particles, which were traced inparallel on the GPU. We then projected the particle densi-ties into a uniform volume grid, and used Mitsuba [104]to perform high quality rendering. The speciﬁc parametersettings for the simulations shown in this paper and therelated performance statistics are reported in Table 3. Recent GPU optimized kinetic solvers have typically em-ployed the BGK or RM-MRT [105] collision models. In thissection, we compare the quality of simulations obtainedusing these models with that using the ACM-MRT model.We also compare the performance of these models.To ensure fairness in the quality comparisons, we use theD3Q27 lattice for all three collision models. Fig. 17 showsvelocity ﬁelds obtained from the D3Q27 BGK, RM-MRTand ACM-MRT collision models. Since the BGK model isnot stable at high Reynolds numbers, we set its Reynoldsnumber as high as possible while remaining stable. Despitethis, we still obtain an unnaturally smooth result, as shownin Fig. 17 (a). The RM-MRT model can be stable at highReynolds numbers but will be very dispersive, as seen inFig. 17 (b), and exhibit unnatural oscillation artifacts, as

Fig. 18.

Single GPU performance comparisons.

Red bars: the base-line kinetic solver (cf. Sec. 3.2). Green bars: the modiﬁed baseline kineticsolver using CSoA data layout for ﬂuid nodes. Blue bars: our GPU-optimized kinetic solver. The comparisons were done with simulations atdifferent grid resolutions and with resolution-matched numbers of solidsamples in order to maintain simulation accuracy.Fig. 19.

Performance comparisons for different collision models.

We show timings for the collision step only, as the other parts of the dif-ferent models employ the same optimizations and thus have nearly thesame performance. At different resolutions, our GPU-optimized ACM-MRT model performs nearly as efﬁciently as the basic BGK model, butproduces much more accurate results for turbulent ﬂows. shown in the magniﬁed inset. Finally, the ACM-MRT modelcan support very high Reynolds numbers with very limiteddiffusion and dispersion, and produces the most visuallyappealing result, as shown in Fig. 17 (c).When comparing performance, we ﬁrst note that to ourknowledge, there are no GPU-optimized kinetic solversusing the D3Q27 lattice structure, and also no GPU-basedsolvers using the immersed boundary method. For thesereasons, it is difﬁcult to ﬁnd suitable existing works againstwhich to compare the performance of our optimized solver.Instead, we have chosen to implement D3Q27 versions ofthe BGK and RM-MRT models using an optimized CSoAdata layout to compare with our solver. We now give adetailed analysis showing the performance advantages ofour solver in both single and multi-GPU settings for variousresolutions of the simulation domain.

We ﬁrst compare the performance of our solver on a singleGPU with the baseline solver described in Sec. 3.2, and asolver using an optimized CSoA data layout for ﬂuid nodes,but employing none of the other optimizations described inSec. 4. Fig. 18 shows the performance of the three solversat different grid resolutions. The baseline implementationsuffers from uncoalesced memory accesses, and its costincreases signiﬁcantly as resolution increases. The CSoA-based solver is able to coalesce memory accesses for ﬂuidnodes but does not optimize the IB method, so it alsoshows limited performance as the resolution increases. Fi-nally, our optimized solver shows increasing performanceimprovements relative to the other two solvers as resolutionincreases, and is over . × and . × faster than the CSoAand baseline solvers at the highest resolution we tested.We also present another comparison of the BGK, RM-MRT and ACM-MRT implementations in which we em-ployed the same optimizations described in Sec. 4 for all CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 12

Figures. Grid resolution No. of solid samples (cid:96) log α No. of GPUs Storage time / ∆ t total time steps total time costFig. 1 1200 × ×

840 3,688,157 2 28 4 58.4 GB 1.21 sec. 76,600 1544.8 min.Fig. 23 (a) 400 × ×

400 48,336 2 9 1 7.4 GB 0.56 sec. 20,000 186.7 min.Fig. 23 (b) 400 × ×

400 53,104 4 16 1 7.4 GB 0.58 sec. 25,600 247.5 min.Fig. 24 2240 × ×

480 1,971,941 2 29 4 79.5 GB 1.70 sec. 26,300 745.2 min.Fig. 25 2100 × ×

600 909,788 2 28 4 69.9 GB 1.43 sec. 41,500 989.1 min.

TABLE 3Parameter settings and performance statistics for different simulations shown in the paper. three models, and only vary how the collision operationis performed. The results are shown in Fig. 19. The plotshows that our optimized ACM-MRT implementation hasnearly the same computational efﬁciency as the signiﬁcantlysimpler BGK model, while producing much more visuallyappealing outputs. In general, the cost of arithmetic cal-culations occupies a relatively small portion of the overallrunning time, indicating that efﬁcient memory accesses playa key role in optimizing kinetic solvers on the GPU.

In this section, we analyze the performance of our optimizedkinetic solver running on different numbers of GPUs atthe same grid resolution. As stated earlier, communicationamong GPUs is currently done by copying data through theCPU, and we expect that direct GPU to GPU communicationcan lead to substantial performance improvements.Fig. 20 (a) shows the change in runtime when we per-form a simulation of size × × using 1 to 4 GPUs.We observed nearly perfect scaling when going from 1 to2 GPUs. However, as the number of GPUs increases, thecommunication cost and serialization at the CPU start todominate, leading to reduced scaling. To better understandthe impact of communication on performance, we also per-formed a simulation of size × × on 2 to 4 GPUs asshown in Fig. 20 (b). Since we subdivided the simulationdomain along the z -axis, the amount of communicationbetween neighboring GPUs doubled in the new simulationwhile the total amount of computation quadrupled. Thus,communication is expected to have smaller impact on over-all performance. This is indeed the observed behavior, asscaling from 2 to 3 and 4 GPUs leads to improvements of . × and . × resp. in the larger-scale simulation, and only . × and . × in the smaller-scale experiment. Thus, weexpect that the scaling behavior can be further improved aswe perform simulations at even larger scales. To further highlight the advantages in performance andaccuracy of our GPU-optimized kinetic solver, we com-pared it with the recently proposed reﬂection-advectionMacCormack (MC+R) solver [7], which we implementedon the GPU using an SoA data layout and using NVIDIA’soptimized cuSPARSE library [106] for solving sparse linearsystems. We note that a newer method called “BiMocq ” [8]can better preserve turbulence details and possibly offerenhanced performance compared to MC+R on the GPU.However, there is no publicly available full GPU implemen-tation of the BiMocq method, making it difﬁcult to compareagainst. In addition, based on the results from [8] at similargrid resolutions, the method does not produce higher visual Fig. 20.

Performance comparison for multi-GPU simulations . In (a),we use a grid resolution of × × and 1,579,468 solid sampleson 1 to 4 GPUs. In (b), we increase the simulation size to × × with the same number of solid samples, and run on 2 to 4 GPUs. Datacopying has a larger impact on the smaller simulation, while the higher-resolution simulation shows improved scalability. quality compared to our kinetic method. Thus, in the follow-ing performance analysis we base our comparison againstthe GPU-based MC+R INSE solver.To ensure accuracy, especially for turbulent ﬂow sim-ulations, the preconditioned conjugate gradient (PCG) al-gorithm [107] was used to solve the sparse linear systemsusing a sufﬁcient number of iterations. At a resolutionof × × , we found that at least 300 iterationswere needed for proper simulation of turbulent ﬂows whenbalancing accuracy and efﬁciency. The performance of INSEsolvers can be improved by using larger time step sizeswhile still producing visually plausible results, especiallyin scenarios such as real time applications. However, wefound that in turbulent ﬂow simulations, larger time stepsizes sacriﬁce temporal accuracy and lead to results whichdeviate substantially from the reference output. Below, wecompare our simulation with those from the MC+R INSEsolver at different time step sizes.Fig. 21 shows a visual comparison between our kineticsolver and an MC+R INSE solver using different time stepsizes. Clearly, using a time step size which is (a) times(Fig. 21 (a)) or (b) 10 times (Fig. 21 (b)) larger than thatused in our kinetic solver (Fig. 21 (d)) can improve theINSE solver’s performance to a degree that it becomescomparable to ours. However, these step sizes also produceﬂow patterns which deviate substantially from the referenceand lose large amounts of turbulence details. The INSEsimulation in Fig. 21 (c) uses the same time step size asours and produces similar ﬂow patterns, which indicates theaccuracy of our solver. However, our solver has far superiorperformance, as demonstrated in Fig. 22 for different gridresolutions and with different time step sizes.We note that the performance of INSE solvers can befurther improved by techniques such as multi-grid methods.Based on an existing report [78], multi-grid methods cantypically accelerate INSE solvers by 10 to 20 times on amulti-core CPU, and can be expected to run even fasteron a GPU. However, a multi-grid INSE solver is still less CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 13

Fig. 21.

Visual comparison of ﬂuid solvers at different settings.

MC+R INSE solver with time step sizes (a) × , (b) × and (c) equal to oursolver. (d) Output of our kinetic solver using the ACM-MRT model. (e) Output of our kinetic solver using a coarser grid resolution (half the resolutionin each dimension). For the simulations in (a) to (d), the grid resolution is 200 × × × and × slower respectively than our solver. In addition, due to its higher convergence rate, ourkinetic solver operating at a lower grid resolution in (e) still produces similar smoke patterns to those shown in (d), but runs 6 times faster than (d)and 300 times faster than (c). These results demonstrate the strength of our kinetic solver in both performance and visual quality.Fig. 22. Performance comparison with MC+R INSE solver usingdifferent time step sizes.

We show the timings (by simulating the same one physical second ) of a GPU-based MC+R INSE solver with varyingtime step sizes and our GPU-optimized kinetic solver under different gridresolutions. Due to large variations in timing, we plot the graph using alog-scale. Note the high scalability of our kinetic solver. scalable over multiple GPUs than our kinetic solver and thusless efﬁcient when grid resolution is very high. In addition,our solver converges faster than many existing INSE solversused in the graphics domain (as shown in e.g. [2]), implyingthat we can use lower grid resolutions while still producingvisually similar results. Fig. 22 (e) shows a simulation withour solver using only half the resolution in each dimension(and thus 8 × fewer grid nodes) while producing patternsvery similar to the higher resolution results in (c) and (d).Note that at the lower grid resolution, our solver is evenfaster than the MC+R INSE solver using a × larger timestep size, demonstrating our solver can both run faster andproduce higher quality results than state-of-the-art INSEsolvers. Lastly, our solver exhibits better scalability thanGPU-based INSE solvers, enabling fast simulations at highgrid resolutions. Using our high-performance kinetic solver, we can efﬁ-ciently generate a variety of ﬂuid simulation results. Fig.23 shows two ﬂow simulations of colliding vortices where aball obstacle is immersed in the center and a rotating torusdisturbs the surrounding smoke. To enable high resolutionsimulations, we ran our solver on a 4-GPU system. Figs. 1,24 and 25 show high resolution simulations involving com-plex solid objects (both static and dynamic) computed on themulti-GPU system within a relatively short amount of timeand displaying detailed and realistic-looking turbulence.

Fig. 23.

Smoke simulations at a normal resolution . (a) simulation ofvortices colliding around a ball; (b) simulation of a torus rotating in theair. Both simulations use a grid resolution of 400 × × The optimal values of model parameters (cid:96) and α maychange depending on the shape of the immersed solid aswell as the number of solid samples within each ﬂuid cell.For simple geometries such as a ball or a torus, the costfor different (cid:96) and α values is similar, as shown in Fig. 10(a). The optimal (cid:96) lies in the range [2 , , while the optimal α ≥

64 = 2 ; picking any (cid:96) and α in this range leads toacceptable performance. However, for complex geometrieswith high degrees of concavity, the cost surface exhibits sig-niﬁcant variations with a number of local peaks, as shownin Fig. 10 (b). If such a geometry also contains a large num-ber of solid samples, the cost surface becomes even moreunpredictable, as shown in Fig. 10 (c). Thus, conducting anautomated search for optimal parameter values before thestart of a simulation is very useful in complex simulationscenarios. While our kinetic solver is in general much faster than INSEsolvers and produces equal or superior visual results, oneof its drawbacks is the need for larger amounts of memory,typically 3 times more than the MC+R INSE solver at thesame grid resolution. In addition, in order to increase GPUoccupancy, we employed multiple kernel launches, whichfurther increases the memory requirement. In situationswhere memory is limited, we can eliminate multiple kernellaunches to save memory, but at the cost of lower occupancy

CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 14

Fig. 24.

High resolution simulation of airﬂow around a car . Weemployed 4 GPUs to simulate the airﬂow over a car at a high resolution,displaying realistic wake turbulence details behind the car.Fig. 25.

High resolution simulation of a moving airplane . We em-ployed 4 GPUs to simulate an airplane moving through high resolutionsmoke and generating detailed wake turbulence. and reduced simulation efﬁciency. However, we note thateven in these cases, our solver is still much faster thanGPU-based INSE solvers. Furthermore, for high resolutionsimulations with dynamic solids, we currently store all solidsamples in each GPU to avoid communication, which againleads to a moderate increase in memory usage. Finally,our GPU optimizations are currently not suitable for liquidsimulations with kinetic solvers. These simulations wouldneed modiﬁcations to the kinetic algorithm, such as theuse of a volume-of-ﬂuid method, and thus require revisingour memory layout. These limitations point toward furtherresearch on more versatile and memory-efﬁcient GPU-basedkinetic solvers.

ONCLUSION

In this paper, we proposed GPU optimization techniques tosigniﬁcantly improve the performance of an IB-based kineticsolver using the recently developed ACM-MRT model. Oursolver is faster than all other currently available ﬂuid sim-ulation methods, especially for turbulent ﬂow simulationsover complex solid geometries. It also produces equal orsuperior visual quality compared to state-of-the-art INSEsolvers. Our optimizations are based on a parametric datalayout which considers both ﬂuid nodes and solid sam-ples to improve memory coalescing and bandwidth use.We also proposed a GPU-optimized parallel algorithm forthe immersed boundary method to handle boundary con-ditions, which signiﬁcantly reduces load imbalance andthread divergence. We used multiple kernel launches witha simpliﬁed formulation of the ACM-MRT collision modelto improve GPU occupancy. Finally, we combined theseoptimizations into an integrated cost model to automaticallysearch foroptimal parameter settings to obtain the bestperformance. We conducted comprehensive comparisonsagainst a range of state-of-the-art solution methods andsolvers to validate the high efﬁciency and visual ﬁdelityof our method. In the future, we will further optimize our solver, including improving its performance in multi-GPUsystems and scaling to larger problem sizes. A CKNOWLEDGMENTS

We thank all the reviewers for their constructive comments.We also thank Yihui Ma, Chenqi Luo and Chaoyang Lyufrom the FLARE Lab at ShanghaiTech University for helpingwith video preparations. We are grateful to YOKE Intel-ligence for providing the 3D reconstruction in Fig. 1 forour large scale simulation. This work was supported by thestartup fund of ShanghaiTech University and the NationalNatural Science Foundation of China (Grant No. 61976138). R EFERENCES [1] W. Li, K. Bai, and X. Liu, “Continuous-scale kinetic ﬂuid simu-lation,”

IEEE Transactions on Visualization and Computer Graphics ,vol. 25, no. 9, pp. 2694–2709, Sep. 2019.[2] W. Li, Y. Chen, M. Desbrun, C. Zheng, and X. Liu, “Fast and scal-able turbulent ﬂow simulation with two-way coupling,”

ACMTransactions on Graphics (SIGGRAPH 2020) , vol. 39, no. 4, 2020.[3] P. Mullen, K. Crane, D. Pavlov, Y. Tong, and M. Desbrun,“Energy-preserving integrators for ﬂuid animation,” in

ACMTransactions on Graphics (SIGGRAPH 2009) , vol. 28, no. 3. ACM,2009, p. 38.[4] B. Zhu, W. Lu, M. Cong, B. Kim, and R. Fedkiw, “A new gridstructure for domain extension,”

ACM Transactions on Graphics(SIGGRAPH 2013) , vol. 32, no. 4, p. 63, 2013.[5] M. Ihmsen, J. Orthmann, B. Solenthaler, A. Kolb, and M. Teschner,“SPH ﬂuids in computer graphics,” 2014.[6] C. Fu, Q. Guo, T. Gast, C. Jiang, and J. Teran, “A polynomialparticle-in-cell method,”

ACM Transactions on Graphics , vol. 36,no. 6, p. 222, 2017.[7] J. Zehnder, R. Narain, and B. Thomaszewski, “An advection-reﬂection solver for detail-preserving ﬂuid aimulation,”

ACMTransactions on Graphics (SIGGRAPH 2018) , vol. 37, no. 4, p. 85,2018.[8] Z. Qu, X. Zhang, M. Gao, C. Jiang, and B. Chen, “Efﬁcient andconservative ﬂuids using bidirectional mapping,”

ACM Transac-tions on Graphics (SIGGRAPH 2019) , vol. 38, no. 4, p. 128, 2019.[9] X. Shan, X.-F. Yuan, and H. Chen, “Kinetic theory representationof hydrodynamics: A way beyond the Navier-Stokes equation,”

Journal of Fluid Mechanics , vol. 550, pp. 413–441, 2006.[10] Y. Guo, X. Liu, and X. Xu, “A uniﬁed detail-preserving liquidsimulation by two-phase lattice Boltzmann modeling,”

IEEETransactions on Visualization and Computer Graphics , vol. 23, no. 5,pp. 1479–1491, May 2017.[11] X. Wei, Y. Zhao, Z. Fan, W. Li, F. Qiu, S. Yoakum-Stover, and A. E.Kaufman, “Lattice-based ﬂow ﬁeld modeling,”

IEEE Transactionson Visualization and Computer Graphics , vol. 10, no. 6, pp. 719–729,2004.[12] N. Thuerey, K. Iglberger, and U. Ruede, “Free surface ﬂows withmoving and deforming objects for LBM,”

Proceedings of Vision,Modeling and Visualization 2006 , pp. 193–200, Nov 2006.[13] N. Thuerey and U. Ruede, “Stable free surface ﬂows with thelattice Boltzmann method on adaptively coarsened grids,”

Com-puting and Visualization in Science , vol. 12 (5), 2009.[14] S. Chen and G. D. Doolen, “Lattice Boltzmann method for fuidﬂows,”

Annual review of ﬂuid mechanics , vol. 30, no. 1, pp. 329–364,1998.[15] D. d’Humieres, “Multiple-relaxation-time lattice Boltzmannmodels in three dimensions,”

Philosophical Transactions of the RoyalSociety of London. Series A: Mathematical, Physical and EngineeringSciences , vol. 360, no. 1792, pp. 437–451, 2002.[16] X. Liu, W.-M. Pang, J. Qin, and C.-W. Fu, “Turbulence simulationby adaptive multi-relaxation lattice Boltzmann modeling,”

IEEEtransactions on visualization and computer graphics , vol. 20, no. 2,pp. 289–302, 2012.[17] C. S. Peskin, “Flow patterns around heart valves: A numericalmethod,”

Journal of computational physics , vol. 10, no. 2, pp. 252–271, 1972.

CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 15 [18] G. Herschlag, S. Lee, J. S. Vetter, and A. Randles, “GPU dataaccess on complex geometries for D3Q19 lattice Boltzmannmethod,” in . IEEE, 2018, pp. 825–834.[19] E. Calore, A. Gabbana, S. F. Schifano, and R. Tripiccione, “Opti-mization of lattice Boltzmann simulations on heterogeneous com-puters,”

The International Journal of High Performance ComputingApplications , vol. 33, no. 1, pp. 124–139, 2019.[20] J. T¨olke and M. Krafczyk, “Teraﬂop computing on a desktop PCwith GPUs for 3D CFD,”

International Journal of ComputationalFluid Dynamics , vol. 22, no. 7, pp. 443–456, 2008.[21] S. Chen and G. D. Doolen, “Lattice Boltzmann method for ﬂuidﬂows,”

Annual review of ﬂuid mechanics , vol. 30, no. 1, pp. 329–364,1998.[22] J. Stam, “Stable ﬂuids,” in

SIGGRAPH , vol. 99, 1999, pp. 121–128.[23] B. Kim, Y. Liu, I. Llamas, and J. R. Rossignac, “Flowﬁxer: UsingBFECC for ﬂuid simulation,” Georgia Institute of Technology,Tech. Rep., 2005.[24] R. Wang, H. Feng, and R. J. Spiteri, “Observations on the ﬁfth-order WENO method with non-uniform meshes,”

Applied Math-ematics and Computation , vol. 196, no. 1, pp. 433–447, 2008.[25] A. Selle, R. Fedkiw, B. Kim, Y. Liu, and J. Rossignac, “Anunconditionally stable MacCormack method,”

Journal of ScientiﬁcComputing , vol. 35, no. 2-3, pp. 350–371, 2008.[26] Q. Cui, P. Sen, and T. Kim, “Scalable Laplacian eigenﬂuids,”

ACMTransactions on Graphics (SIGGRAPH 2018) , vol. 37, no. 4, p. 87,2018.[27] R. Fedkiw, J. Stam, and H. W. Jensen, “Visual simulation ofsmoke,” in

Proceedings of the 28th annual conference on Computergraphics and interactive techniques (SIGGRAPH 2001) .[28] S. I. Park and M. J. Kim, “Vortex ﬂuid for gaseous phenomena,” in

Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposiumon Computer animation . ACM, 2005, pp. 261–270.[29] S. Weißmann and U. Pinkall, “Filament-based smoke with vortexshedding and variational reconnection,” in

ACM Transactions onGraphics (SIGGRAPH 2010) , vol. 29, no. 4. Citeseer, 2010, p. 115.[30] T. Pfaff, N. Thuerey, J. Cohen, S. Tariq, and M. Gross, “Scalableﬂuid simulation using anisotropic turbulence particles,” in

ACMTransactions on Graphics (SIGGRAPH Asia 2010) , vol. 29, no. 6.ACM, 2010, p. 174.[31] T. Brochu, T. Keeler, and R. Bridson, “Linear-time smoke an-imation with vortex sheet meshes,” in

Proceedings of the ACMSIGGRAPH/Eurographics Symposium on Computer Animation . Eu-rographics Association, 2012, pp. 87–95.[32] A. Golas, R. Narain, J. Sewall, P. Krajcevski, P. Dubey, and M. Lin,“Large-scale ﬂuid simulation using velocity-vorticity domaindecomposition,”

ACM Transactions on Graphics (SIGGRAPH Asia2012) , vol. 31, no. 6, p. 148, 2012.[33] X. Zhang and R. Bridson, “A PPPM fast summation method forﬂuids and beyond,”

ACM Transactions on Graphics (SIGGRAPH2014) , vol. 33, no. 6, p. 206, 2014.[34] X. Zhang, R. Bridson, and C. Greif, “Restoring the missingvorticity in advection-projection ﬂuid solvers,”

ACM Transactionson Graphics (SIGGRAPH 2015) , vol. 34, no. 4, pp. 52:1–52:8, Jul.2015.[35] R. Bridson, J. Houriham, and M. Nordenstam, “Curl-noise forprocedural ﬂuid ﬂow,” in

ACM Transactions on Graphics (SIG-GRAPH 2007) , vol. 26, no. 3. ACM, 2007, p. 46.[36] T. Kim, N. Th ¨uerey, D. James, and M. Gross, “Wavelet tur-bulence for ﬂuid simulation,” in

ACM Transactions on Graphics(SIGGRAPH 2008) , vol. 27, no. 3. ACM, 2008, p. 50.[37] S. Jeong, B. Solenthaler, M. Pollefeys, M. Gross et al. , “Data-drivenﬂuid simulations using regression forests,”

ACM Transactions onGraphics (SIGGRAPH 2015) , vol. 34, no. 6, p. 199, 2015.[38] M. Chu and N. Thuerey, “Data-driven synthesis of smokeﬂows with CNN-based feature descriptors,”

ACM Transactionson Graphics (SIGGRAPH 2017) , vol. 36, no. 4, p. 69, 2017.[39] Y. Xie, E. Franz, M. Chu, and N. Thuerey, “TempoGAN: Atemporally coherent, volumetric GAN for super-resolution ﬂuidﬂow,”

ACM Transactions on Graphics (SIGGRAPH 2018) , vol. 37,no. 4, p. 95, 2018.[40] K. Bai, W. Li, M. Desbrun, and X. Liu, “Dynamic upsamplingof smoke through dictionary-based learning,”

ACM Transactionson Graphics , vol. 40, no. 1, Sep. 2020. [Online]. Available:https://doi.org/10.1145/3412360[41] F. Losasso, F. Gibou, and R. Fedkiw, “Simulating water andsmoke with an octree data structure,” in

ACM Transactions on Graphics (SIGGRAPH 2004) , vol. 23, no. 3. ACM, 2004, pp. 457–462.[42] R. Setaluri, M. Aanjaneya, S. Bauer, and E. Sifakis, “SPGrid: Asparse paged grid structure applied to adaptive smoke simula-tion,”

ACM Transactions on Graphics (SIGGRAPH 2014) , vol. 33,no. 6, p. 205, 2014.[43] P. Clausen, M. Wicke, J. R. Shewchuk, and J. F. O’brien, “Sim-ulating liquids and solid-liquid interactions with Lagrangianmeshes,”

ACM Transactions on Graphics , vol. 32, no. 2, p. 17, 2013.[44] R. Ando, N. Th ¨urey, and C. Wojtan, “Highly adaptive liquid sim-ulations on tetrahedral meshes,”

ACM Transactions on Graphics(SIGGRAPH 2013) , vol. 32, no. 4, p. 103, 2013.[45] M. Becker and M. Teschner, “Weakly compressible SPH forfree surface ﬂows,” in

Proceedings of the 2007 ACM SIG-GRAPH/Eurographics symposium on Computer animation . Euro-graphics Association, 2007, pp. 209–217.[46] B. Solenthaler and R. Pajarola, “Predictive-corrective incompress-ible SPH,” in

ACM Transactions on Graphics (SIGGRAPH Asia2009) , vol. 28, no. 3. ACM, 2009, p. 40.[47] N. Akinci, M. Ihmsen, G. Akinci, B. Solenthaler, and M. Teschner,“Versatile rigid-ﬂuid coupling for incompressible SPH,”

ACMTransactions on Graphics (SIGGRAPH Asia 2009) , vol. 31, no. 4,p. 62, 2012.[48] R. Winchenbach, H. Hochstetter, and A. Kolb, “Inﬁnite contin-uous adaptivity for incompressible SPH,”

ACM Transactions onGraphics (SIGGRAPH 2017) , vol. 36, no. 4, p. 102, 2017.[49] F. de Goes, C. Wallez, J. Huang, D. Pavlov, and M. Desbrun,“Power particles: An incompressible ﬂuid solver based on powerdiagrams,”

ACM Transactions on Graphics (SIGGRAPH 2015) ,vol. 34, no. 4, pp. 50–1, 2015.[50] J. Bender and D. Koschier, “Divergence-free SPH for incompress-ible and viscous ﬂuids,”

IEEE Transactions on Visualization andComputer Graphics , vol. 23, no. 3, pp. 1193–1206, 2016.[51] S. Band, C. Gissler, and M. Teschner, “Moving least squaresboundaries for SPH ﬂuids,” in

Proceedings of the 13th Workshop onVirtual Reality Interactions and Physical Simulations . EurographicsAssociation, 2017, pp. 21–28.[52] S. Band, C. Gissler, M. Ihmsen, J. Cornelis, A. Peer, andM. Teschner, “Pressure boundaries for implicit incompressibleSPH,”

ACM Transactions on Graphics (SIGGRAPH 2018) ,vol. 37, no. 2, pp. 14:1–14:11, Feb. 2018. [Online]. Available:http://doi.acm.org/10.1145/3180486[53] C. Jiang, C. Schroeder, A. Selle, J. Teran, and A. Stomakhin, “Theafﬁne particle-in-cell method,”

ACM Transactions on Graphics(SIGGRAPH 2015) , vol. 34, no. 4, p. 51, 2015.[54] X. Zhang, M. Li, and R. Bridson, “Resolving ﬂuid boundarylayers with particle strength exchange and weak adaptivity,”

ACM Transactions on Graphics (SIGGRAPH 2016) , vol. 35, no. 4,p. 76, 2016.[55] Y. Zhu and R. Bridson, “Animating sand as a ﬂuid,” in

ACMTransactions on Graphics (SIGGRAPH 2005) , vol. 24, no. 3. ACM,2005, pp. 965–972.[56] M. Geier, A. Greiner, and J. G. Korvink, “Cascaded digital latticeBoltzmann automata for high reynolds number ﬂow,”

PhysicalReview E , vol. 73, no. 6, p. 066705, 2006.[57] D. Lycett-Brown, K. H. Luo, R. Liu, and P. Lv, “Binary dropletcollision simulations by a multiphase cascaded lattice Boltzmannmethod,”

Physics of Fluids , vol. 26, no. 2, p. 023303, 2014.[58] M. Geier, A. Greiner, and J. G. Korvink, “A factorized centralmoment lattice Boltzmann method,”

The European Physical JournalSpecial Topics , vol. 171, no. 1, pp. 55–61, 2009.[59] X. Shan et al. , “Central-moment-based Galilean-invariantmultiple-relaxation-time collision model,”

Physical Review E , vol.100, no. 4, p. 043308, 2019.[60] A. De Rosis, “Nonorthogonal central-moments-based latticeBoltzmann scheme in three dimensions,”

Physical Review E ,vol. 95, no. 1, p. 013310, 2017.[61] M. Rohde, D. Kandhai, J. Derksen, and H. Van den Akker,“Improved bounce-back methods for no-slip walls in lattice-Boltzmann schemes: Theory and simulations,”

Physical Review E ,vol. 67, no. 6, p. 066703, 2003.[62] M. Sbragaglia and S. Succi, “Analytical calculation of slip ﬂowin lattice Boltzmann models with kinetic boundary conditions,”

Physics of Fluids , vol. 17, no. 9, p. 093602, 2005.[63] J. C. Verschaeve and B. M ¨uller, “A curved no-slip boundary con-dition for the lattice Boltzmann method,”

Journal of ComputationalPhysics , vol. 229, no. 19, pp. 6781–6803, 2010.

CCEPTED BY IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 16 [64] R. Mei, L.-S. Luo, and W. Shyy, “An accurate curved boundarytreatment in the lattice Boltzmann method,”

Journal of computa-tional physics , vol. 155, no. 2, pp. 307–330, 1999.[65] Z.-G. Feng and E. E. Michaelides, “The immersed boundary-lattice Boltzmann method for solving ﬂuid-particles interactionproblems,”

Journal of Computational Physics , vol. 195, no. 2, pp.602–628, 2004.[66] J. Wu and C. Shu, “An improved immersed boundary-latticeBoltzmann method for simulating three-dimensional incom-pressible ﬂows,”

Journal of Computational Physics , vol. 229, no. 13,pp. 5022–5042, 2010.[67] Y. Sato, T. Hino, and K. Ohashi, “Parallelization of an unstruc-tured Navier-Stokes solver using a multi-color ordering methodfor openmp,”

Computers & Fluids , vol. 88, pp. 496–509, 2013.[68] X. Guo, M. Lange, G. Gorman, L. Mitchell, and M. Weiland,“Developing a scalable hybrid MPI/OpenMP unstructured ﬁniteelement model,”

Computers & Fluids , vol. 110, pp. 227–234, 2015.[69] O. Mashayekhi, C. Shah, H. Qu, A. Lim, and P. Levis, “Automat-ically distributing Eulerian and hybrid ﬂuid simulations in thecloud,”

ACM Transactions on Graphics (SIGGRAPH 2018) , vol. 37,no. 2, p. 24, 2018.[70] T. Brandvik and G. Pullan, “Acceleration of a 3D Euler solverusing commodity graphics hardware,” in , 2008, p. 607.[71] Y. Liu, X. Liu, and E. Wu, “Real-time 3D ﬂuid simulation on GPUwith complex obstacles,” in

IEEE, 2004,pp. 247–256.[72] J. Kr ¨uger and R. Westermann, “Linear algebra operators for GPUimplementation of numerical algorithms,” in

ACM Transactionson Graphics (SIGGRAPH 2003) , vol. 22, no. 3. ACM, 2003, pp.908–916.[73] “CUDA,” parallel Computing Platform. [Online]. Available:https://developer.nvidia.com/cuda-zone[74] “cusparse,” sparse Linear Algebra. [Online]. Available: https://developer.nvidia.com/cusparse[75] J. Thibault and I. Senocak, “CUDA implementation of a Navier-Stokes solver on multi-GPU desktop platforms for incompress-ible ﬂows,” in , 2009, p. 758.[76] M. Griebel and P. Zaspel, “A multi-GPU accelerated solver forthe three-dimensional two-phase incompressible Navier-Stokesequations,”

Computer Science-Research and Development , vol. 25,no. 1-2, pp. 65–73, 2010.[77] R. Henniger, D. Obrist, and L. Kleiser, “High-order accurate solu-tion of the incompressible Navier-Stokes equations on massivelyparallel computers,”

Journal of Computational Physics , vol. 229,no. 10, pp. 3543–3572, 2010.[78] A. McAdams, E. Sifakis, and J. Teran, “A parallel multigridpoisson solver for ﬂuids simulation on large grids,” ser. SCA ’10.Goslar, DEU: Eurographics Association, 2010, p. 65–74.[79] M. Gao, X. Wang, K. Wu, A. Pradhana, E. Sifakis, C. Yuksel, andC. Jiang, “GPU optimization of material point methods,” in

ACMTransactions on Graphics (SIGGRAPH Asia 2018) . ACM, 2018, p.254.[80] K. Wu, N. Truong, C. Yuksel, and R. Hoetzlein, “Fast ﬂuid sim-ulations with sparse volumes on the GPU,” in

Computer GraphicsForum , vol. 37, no. 2. Wiley Online Library, 2018, pp. 157–167.[81] Y. Hu, T.-M. Li, L. Anderson, J. Ragan-Kelley, and F. Durand,“Taichi: A language for high-performance computation on spa-tially sparse data structures,”

ACM Transactions on Graphics (SIG-GRAPH 2019) , vol. 38, no. 6, pp. 1–16, 2019.[82] G. Alfonsi, S. A. Ciliberti, M. Mancini, and L. Primavera, “Per-formances of Navier-Stokes solver on a hybrid CPU/GPU com-puting system,” in

International Conference on Parallel ComputingTechnologies . Springer, 2011, pp. 404–416.[83] Y. Wang, M. Baboulin, K. Rupp, O. Le Maˆıtre, and Y. Fraigneau,“Solving 3D incompressible Navier-Stokes equations on hybridCPU/GPU systems,” in

Proceedings of the High Performance Com-puting Symposium . Society for Computer Simulation Interna-tional, 2014, p. 12.[84] S. Posey, “Considerations for GPU acceleration of parallel CFD,”

Procedia Engineering , vol. 61, pp. 388–391, 2013.[85] W. Li, X. Wei, and A. Kaufman, “Implementing lattice Boltzmanncomputation on graphics hardware,”

The Visual Computer , vol. 19,no. 7-8, pp. 444–456, 2003. [86] Y. Zhao, F. Qiu, Z. Fan, and A. Kaufman, “Flow simulation withlocally-reﬁned LBM,” in

Proceedings of the 2007 Symposium onInteractive 3D Graphics and Games . ACM, 2007, pp. 181–188.[87] G. Wellein, T. Zeiser, G. Hager, and S. Donath, “On the singleprocessor performance of simple lattice Boltzmann kernels,”

Computers & Fluids , vol. 35, no. 8-9, pp. 910–919.[88] K. Mattila, J. Hyv¨aluoma, T. Rossi, M. Aspn¨as, and J. Westerholm,“An efﬁcient swap algorithm for the lattice Boltzmann method,”vol. 176, no. 3, pp. 200–210.[89] N. Delbosc, J. L. Summers, A. Khan, N. Kapur, and C. J. Noakes,“Optimized implementation of the lattice Boltzmann method ona graphics processing unit towards real-time ﬂuid simulation,”

Computers & Mathematics with Applications , vol. 67, no. 2, pp. 462–475, 2014.[90] J. T¨olke, “Implementation of a lattice Boltzmann kernel usingthe compute uniﬁed device architecture developed by NVIDIA,”

Computing and Visualization in Science , vol. 13, no. 1, p. 29, 2010.[91] P. Bailey, J. Myre, S. D. Walsh, D. J. Lilja, and M. O. Saar, “Accel-erating lattice Boltzmann ﬂuid ﬂow simulations using graphicsprocessors,” in .IEEE, 2009, pp. 550–557.[92] J. Habich, T. Zeiser, G. Hager, and G. Wellein, “Performance anal-ysis and optimization strategies for a D3Q19 lattice Boltzmannkernel on NVIDIA GPUs Using CUDA,”

Advances in EngineeringSoftware , vol. 42, no. 5, pp. 266–272, 2011.[93] M. Bernaschi, M. Fatica, S. Melchionna, S. Succi, and E. Kaxiras,“A ﬂexible high-performance lattice Boltzmann GPU code for thesimulations of ﬂuid ﬂows in complex geometries,”

Concurrencyand Computation: Practice and Experience , vol. 22, no. 1, pp. 1–14,2010.[94] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux, “Multi-GPU implementation of the lattice Boltzmann method,”

Comput-ers & Mathematics with Applications , vol. 65, no. 2, pp. 252–261,2013.[95] J. Myre, S. D. Walsh, D. Lilja, and M. O. Saar, “Performance anal-ysis of single-phase, multiphase, and multicomponent lattice-Boltzmann ﬂuid ﬂow simulations on GPU clusters,”

Concurrencyand Computation: Practice and Experience , vol. 23, no. 4, pp. 332–350, 2011.[96] S. Harris,

An Introduction to the Theory of the Boltzmann Equation .Courier Corporation, 2004.[97] S. Girimaji, “Lattice Boltzmann method: Fundamentals and engi-neering applications with computer codes,” 2012.[98] C. Yuksel, “Sample elimination for generating Poisson disk sam-ple sets,” in

Computer Graphics Forum , vol. 34, no. 2. Wiley OnlineLibrary, 2015, pp. 25–32.[99] T. Kr ¨uger, F. Varnik, and D. Raabe, “Efﬁcient and accurate sim-ulations of deformable particles immersed in a ﬂuid using acombined immersed boundary lattice Boltzmann ﬁnite elementmethod,”

Computers & Mathematics with Applications , vol. 61,no. 12, pp. 3485–3505, 2011.[100] K. Mattila, J. Hyv¨aluoma, J. Timonen, and T. Rossi, “Comparisonof implementations of the lattice-Boltzmann method,”

Computers& Mathematics with Applications , vol. 55, no. 7, pp. 1514–1524,2008.[101] P. Valero-Lara, F. D. Igual, M. Prieto-Mat´ıas, A. Pinelli, andJ. Favier, “Accelerating ﬂuid-solid simulations (lattice-Boltzmann& immersed-boundary) on heterogeneous architectures,”

Journalof Computational Science , vol. 10, pp. 249–261, 2015.[102] G. M. Morton, “A computer oriented geodetic data base and anew technique in ﬁle sequencing,” 1966.[103] A. Heron and J. Adam, “Particle code optimization on vectorcomputers,”

Journal of Computational Physics

Computers & Mathematics with Applications , vol. 69, no. 6,pp. 518–529, 2015.[106] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “cuS-PARSE Library,” in

GPU Technology Conference , 2010.[107] R. Helfenstein and J. Koko, “Parallel preconditioned conjugategradient algorithm on GPU,”