lbmpy: Automatic code generation for efficient parallel lattice Boltzmann methods
llbmpy: Automatic code generation for efficientparallel lattice Boltzmann methods
Martin Bauer
Chair for System Simulation,FAU Erlangen-N¨urnberg [email protected]
Harald K¨ostler
Chair for System Simulation,FAU Erlangen-N¨urnberg [email protected]
Ulrich R¨ude
Chair for System Simulation,FAU Erlangen-N¨urnbergCERFACS, 31057 Toulouse Cedex 1, France [email protected]
Abstract —Lattice Boltzmann methods are a popular mesoscopicalternative to classical computational fluid dynamics based on themacroscopic equations of continuum mechanics. Many variantsof lattice Boltzmann methods have been developed that vary incomplexity, accuracy, and computational cost. Extensions areavailable to simulate multi-phase, multi-component, turbulent,and non-Newtonian flows. In this work we present lbmpy , a codegeneration package that supports a wide variety of different lat-tice Boltzmann methods. Additionally, lbmpy provides a genericdevelopment environment for new schemes. A high-level domain-specific language allows the user to formulate, extend and test var-ious lattice Boltzmann methods. In all cases, the lattice Boltzmannmethod can be specified in symbolic form. Transformations thatoperate on this symbolic representation yield highly efficient com-pute kernels. This is achieved by automatically parallelizing themethods, and by various application-specific automatized stepsthat optimize the resulting code. This pipeline of transformationscan be applied to a wide range of lattice Boltzmann variants, in-cluding single- and two-relaxation-time schemes, multi-relaxation-time methods, as well as the more advanced cumulant methods,and entropically stabilized methods. lbmpy can be integrated intohigh-performance computing frameworks to enable massivelyparallel, distributed simulations. This is demonstrated using the WA LB ERLA multiphysics package to conduct scaling experimentson the SuperMUC-NG supercomputing system on up to 147 456compute cores.
I. I
NTRODUCTION
Computational science and engineering is an interdisciplinaryfield. The workflow of creating computational models starts at(physical) reality and ends with the production of efficientlyexecutable computer code [47]. However, on the route fromphysical phenomena to machine code lies the formulation ofmathematical models and their discretization, the constructionof time stepping schemes and solution methods, the designand analysis of parallel algorithms, the realization of complexsoftware systems, and finally the transformation to code thatcan be executed on a given hardware. The target computerarchitecture may be a massively parallel system, possiblyheterogeneous and using accelerators that can only be exploitedby special programming techniques. During the development,many choices must be taken and alternatives considered. Thus,creating computational science software is a work-intensive,time-consuming, and error prone task, whose complexity iseasily underestimated, despite its fundamental relevance forextracting reliable predictions from scientific principles. In thisarticle, we will present progress towards the systematic design of scientific software based on the automatic derivation ofmethods including the automatic design of efficient software.When designing scientific software with conventional program-ming techniques, it is often difficult to find the right balancebetween the flexibility of an approach and its performance,since these are often conflicting goals. Often the specializationto a restricted class of problems would permit special optimiza-tions and can thus lead to more efficient codes. However, aflexible software design may lead to more extensible and moregenerally usable software. Additionally, using general-purposelibraries for subtasks, such as for the solution of linear systems,may reduce development time, but may also lead to overheadswhen special structures that would lead to faster algorithmscan not be exploited.Furthermore, it is typical that computational software outlivesthe computer systems that it was originally designed for. Thisleads to the problem of performance portability. The choice of aspecific algorithm and a specific software design may have beengood for the efficient execution on older computer architectures,but these design choices may turn out to be a major bottleneckon the accelerator-based architectures that dominate high-endcomputing today. A complete rewrite would be necessary, butsince this is too time-consuming and expensive. Legacy codetoday may stay in use though it severely underperforms onmodern hardware.A generic approach to overcome these difficulties are more ad-vanced abstractions. For example, libraries such as Kokkos [13]or programming systems such as OpenACC [15] presentabstractions of fine-granular concurrent execution on modernhardware that can help to alleviate the problems of performanceportability. An alternative to such libraries can be metaprogram-ming techniques and the usage of domain specific languages(DSL). Here machine optimized code is generated utilizingprogram transformations and compiler technology. Using DSLsopens additional possibilities based on application-specificabstractions. A prominent and successful example for a finiteelement specific DSL is UFL (unified form language) [1]which is embedded in python. It is e.g. used in FeniCS [40] orFiredrake [45]. These automated computing platforms permitthe generation of computational models based on a wide varietyof partial differential equations that can be expressed in aDSL. For stencil based computations there exist a numberof different DSLs e.g. [54], [31], [29], [27] that work on a r X i v : . [ c s . M S ] A p r tructured grids and e.g. the ExaSlang DSL that can alsowork on block-structured grids [38]. These DSLs succeed insupporting a wide variety of mathematical models with highflexibility. Additionally, most of them also support parallelprogram execution. Such approaches can exploit applicationknowledge represented in the DSL design and can thus useoptimization techniques that are more powerful than optimizingcompilers for general-purpose languages.In the present article, we will focus alternatively on thelattice Boltzmann method (LBM), as a promising mesoscopicmodeling paradigm. Our scope of modeling thus supportskinetic schemes as an alternative to continuum mechanics forfluid dynamics. Kinetic schemes are often performance-hungry,but they also offer a high degree of parallelism. Therefore thescalability and performance on modern computer architecturesis a central goal of our work. In particular, we will attemptto reach the performance of the best manually optimizedLBM codes by developing a new application-specific codegeneration technology. To achieve this, we will leverage longterm experience in manually optimizing general stencil codesand systematic performance engineering for LBM methods[59], [57], [58], [22], [17]. In particular, we will build on WA LB ERLA as a state-of-the-art LBM software frameworkwith excellent performance characteristics [5], [39].However, the current article goes beyond developing toolsthat help to optimize given kinetic simulation algorithms for avariety of computer architectures. While this alone is also useful,we here extend the code transformation approach to the earlierstages of the computational science workflow. In particular, wepoint out that the design and derivation of lattice Boltzmannmodels follows a complex but systematic methodology. Thedevelopment of a specific LBM for a given application ischaracterized by various options and choices that the methoddesigner must make. This involves selecting a lattice model,defining momenta and relaxation rates, and many more, aswill be elaborated in detail below. Based on these choices, avery wide variety of LBMs can be derived. While any suchmodel could be constructed manually and then optimized usingprogram generation technology, this article goes an essentialstep further. The manual development of advanced LBM canalso be time-consuming and error-prone. We will presenthere, how the derivation of the models themselves can beconceptualized so that it becomes amenable to automation. In lbmpy , the tedious mechanical steps of the LBM developmentcan be performed by automatic symbolic manipulations, savingprecious developer time and making the methods more reliableand reproducible.In summary, lbmpy jointly with WA LB ERLA becomes acomputing platform for LBMs that is equivalent to whatFEniCS is for finite elements. lbmpy helps to realize highlyefficient LBM implementations and makes it easy for thedeveloper to experiment with different variants of the methods.Fully grown implementations of many different LBMs canbe generated with a single mouse click. We emphasize herespecifically that our code generation technology has the uniquecapability to generate highly efficient parallel code that is ready to run with optimal scalability on the largest supercomputers.The LBM is based on concepts from statistical mechanics. Thefluid is modeled on the mesoscopic level using distributionfunctions that represent the statistical behavior of the particlesconstituting the fluid. Compared to traditional computationalfluid dynamics (CFD), that describes the fluid macroscopicallywith the Navier-Stokes equations, the mesoscopic descriptionpermits higher modeling expressivity, leading to a variety ofLBMs, e.g., for porous media or multi-phase flows. Its localdata access pattern makes the method very well suited formodern hardware architectures that draw their computationalpower of ever-increasing concurrency. To run code efficientlyon these architectures, parallelism on different levels must beexploited, starting from single-instruction-multiple-data (SIMD)instruction sets, utilizing multiple cores per node with OpenMP,to distributed memory parallelism with MPI. Optimizing LBcompute kernels to get the best possible performance requireshardware-specific adaptation. This process costs significantdevelopment time. Unfortunately, it must be repeated for eachnew hardware platform. Furthermore, the optimization processoften leads to code that is hard to read and maintain. In practice,this can lead to the effect that for prototyping and development,a slow but flexible code is used, and only a few proven methodsare re-implemented in a highly optimized way.Over time a large variety of LBMs have been proposed [34].Starting from the simple, but widely used, BGK single-relaxation-time (SRT) operator that relaxes the current statelinearly to equilibrium with a single relaxation rate, overtwo-relaxation-time methods (TRT) [21] to general multi-relaxation-time (MRT) [11] methods. All these methods relaxto equilibrium in moment space and can be viewed as specialcases of MRT. Then, there are multiple advanced methods likecumulant [20] or entropic LBMs [10], where a different setof statistical quantities is chosen for relaxation or the discreteequilibrium is modified. All LB versions have in commonthat they are parameterized by a set of relaxation rates, whichcan either be chosen constant or adapted locally. Computingrelaxation parameters from local quantities, e.g., shear rates,is used to implement turbulence models or to simulate non-Newtonian fluids. Entropic KBC-type models [10], [2], [18]can also be placed into this group since they are constructedbased on an MRT method with two relaxation rates, where onerelaxation rate is chosen subject to a local entropy condition.Additionally, LBMs could be extended with different forcingschemes [53], [26], and the equilibrium could be approximatedup to different orders either in the so-called incompressible [28]or the standard compressible version. During the implemen-tation phase, the space of options is growing still. Differentstorage patterns for the distribution functions can be chosen [3],[19]. Hardware-dependent optimizations, like loop splitting andnon-temporal stores, further increase the performance on recentCPU architectures.Faced with this space of physical models and implemen-tation/optimization options, developers of LB frameworksare faced with two options: Either pick a small set of LBmethods and optimize them for a specific target architecture,epeating the full process when a new LB method or hardwareplatform must be supported or, trying to abstract and automatethe development task. In this work, we show that it isindeed possible to automate many tedious development tasks,like reformulating equations to save floating-point operations,splitting loops for better memory access behavior, or fusingstream and collision kernels. We present the lattice Boltzmanncode generation package lbmpy that solves these problemsby automating large parts of the LBM development andoptimization process. Its source code and documentation areavailable open-source under the GNU AGPLv3 license . lbmpy is a system for development of computational fluiddynamics codes based on the LBM. In this regard, it iscomparable to other LBM frameworks such as OpenLB [30],[42], Palabos [36], [43], elbe [41], [14], LB3D [24], [50],[37], and HemeLB [25]. Some of these LB frameworkslike Sailfish [48] and TCLB [55] also use metaprogrammingtechniques, mainly for portability to GPUs. While theseframeworks support many LB methods, systematic performanceevaluations and optimizations are predominantly available forsimple LB collision operators like single- and two-relaxation-time methods [60], [61], [58], [22].In this work, we first give an overview over the specificLBM design workflow and the associated code generationpipeline in section II. Then we describe the formalism forLB method specification in section III that comprises the twoupper abstraction layers. The transformation to an algorithmicdescription and its optimization is discussed in section IV.Finally, we present performance and scaling results in section V.II. O VERVIEW : LB M
ETAPROGRAMMING P IPELINE
In this section, we first give an overview of the variousabstraction layers of lbmpy ’s metaprogramming pipeline. Inthe following sections, we then discuss each layer in detail.All abstraction layers are implemented based on the computeralgebra system sympy . The code generation system itself isimplemented in Python, but the generated code is produced inC/C++/LLVM for CPUs and in CUDA or OpenCL for GPUs.As illustrated in fig. 1 the most abstract layer represents aLBM in q -dimensional collision space for a DdQq stencil.This collision space is either based on moments or cumulants[20]. To specify an LB scheme, the user defines a basis of thecollision space as well as an equilibrium value and a relaxationparameter for each component. Relaxation parameters do nothave to be constants but can be chosen as a symbolic expressionthat depends on local quantities like shear rates. This permitsthe formulation of models for non-Newtonian fluids, turbulencemodels, or entropic stabilization. lbmpy transforms this high-level representation into a symbolicdescription of the collision operator. The collision operator isstored as a symbolic function Ω : R q → R q , where q is thenumber of discrete distribution functions.In the next step, the collision operator is mapped to acomputational kernel. This stage allows the generation of purecollision kernels as well as fused stream-collide kernels. https://i10git.cs.fau.de/pycodegen/lbmpy Figure 1: Abstraction layers of metaprogramming pipeline.Additionally, a concrete storage pattern for the distributionfunctions is selected, e.g., two-array push or pull patterns ormore elaborate single array storage schemes. Optionally onecan integrate boundary handling or output of macroscopicquantities at this layer as well. The output of this stage is asymbolic stencil representation of an LB compute kernel. Thisstencil representation is passed to the pystencils package [7]that produces the actual code for CPUs or GPUs. pystencils isextended with custom optimization passes to extend the code-generation pass with domain-specific knowledge. In this workwe focus on the CPU backend, GPU-specific optimizationsand performance results are part of a second publication.III. M
ODEL DESCRIPTION IN COLLISION SPACE
In this section we first give a short overview over the theory thatis used to define LBMs on the highest abstraction level. Thenwe describe the “LB method layer” in more detail followedby examples showing how common collision operators can bespecified on this layer.
A. Collision in moment space
All LBMs considered here discretize the computational domainusing a Cartesian, uniformly spaced grid. The discretizationis specified in stencil notation as
DdQq , describing a d -dimensional domain where each cell contains q particle distri-bution functions (PDFs) labeled f q ( x i , t ) with q ∈ { , ..., q } .The PDF f q represents the mass fraction of particles movingalong a lattice velocity c q . In the following, we use latticeunits, i.e., the positions x i and the time t are integers.The most basic and probably also the still most commonly usedcollision operator for LBMs is the single-relaxation-time (SRT)r BGK operator. It relaxes each population to its equilibriumdistribution using a single relaxation parameter ω . The SRTLBM can be succinctly written as f q ( x + c q , t + 1) = ω f ( eq ) q ( x , t ) + (1 − ω ) f q ( x , t ) , (1)describing a single time step of the LBM, evolving the systemfrom time t to t + 1 . It can be split into a local collision stepthat computes a convex combination of the current state withthe equilibrium state and a non-local streaming step that copiesPDFs to neighboring cells. The collision is formulated using arelaxation rate ω which is the inverse of the relaxation time τ ,i.e., ω = 1 /τ . The local density ρ and velocity u are computedfrom the distribution function as ρ = (cid:88) q f q and u i = 1 ρ (cid:88) q c qi f q . (2)With these macroscopic values the equilibrium is given as f ( eq ) q = w q ρ + w q ρ (cid:2) c qα u α c s + 12 c s u α u β ( c qα c qβ − c s δ αβ ) (cid:124) (cid:123)(cid:122) (cid:125) + 12 c s (cid:0) u α u β u γ ( c qα c qβ c qγ − c s c qα δ βγ ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) (cid:3) . (3)The reference density ρ can either be chosen as to obtain aso-called incompressible LBM [28], for standard compressibleLBM set ρ = ρ . Note that both versions approximate theincompressible Navier-Stokes equations (NSE) [34]. Equation(3) shows a third-order equilibrium approximation. To obtainthe NSE in the macroscopic limit, the equilibrium is requiredonly up to second order in u [34].While the BGK collision operator is still widely used in practice,we aim for a more generic description of LBMs. To develop ahigh-level description of LBMs that is used as the input to themetaprogramming pipeline, we rely on the formalism of multi-relaxation-time methods. It includes the BGK operator andthe popular two-relaxation-time method as special cases. Withthis approach lbmpy is able to generate all LB schemes whosecollision operator can be written in linear matrix form. Thiscovers the majority of LBMs, with the most notable exceptionbeing the cumulant collision operator that will be treated insection III-B.The MRT formalism also allows us to derive the discreteequilibrium (3) from its continuous counterpart instead ofmanually specifying it. The collision operator of MRT methodsfirst transforms the PDFs from population space into momentspace via a moment matrix M . The components of this momentvector m = M f are relaxed to equilibrium values m ( eq ) usinga diagonal matrix of relaxation rates S . One collide-streamstep of an MRT method then reads f q ( x + c q , t + 1) = M − (cid:104) S m ( eq ) + ( I − S ) M f (cid:105) (4)with I denoting the identity matrix. To fully specify the method,we have to define a concrete moment matrix M , a vector of corresponding equilibrium values m ( eq ) , and a diagonalrelaxation matrix S . We now discuss how each of these threeingredients can be specified in lbmpy .
1) Moment Space:
We begin with the transformation frompopulation space to moment space via the moment matrix M . To derive an invertible transformation M , a set of q independent moments is required. Moments can be identifiedwith polynomials in the lattice velocities P ( c q ) : R d → R .For example, the zeroth moment, i.e., the density, is given bythe constant polynomial . The first moments, i.e., the x, y,and z momentum densities, are represented by c qx , c qy , and c qz . Second order polynomials describe viscosity modes, forexample in 2D, c qx + c qy is a mode related to bulk viscosity,whereas c qx · c qy and c qx − c qy are modes related to shearviscosity. The moment value is then computed as Π P ( f ) = (cid:88) c q ∈S P ( c q ) f q (5)with a stencil S that is given by a sequence of directions withinteger components. Given a sequence of moment polynomials P ... P k the elements of the moment matrix are computed as M kq = P k ( c q ) f q . So we have to select q moment polynomialsthat yield an invertible moment matrix M . In lbmpy , thereare various options to provide these moment polynomials. Themost basic but also most flexible option is to list them explicitly.Then lbmpy automatically computes the moment matrix andchecks that it is invertible. This option is useful if code for agiven MRT method from literature has to be generated.One central goal of lbmpy is to derive LB methods automat-ically, and not having to pass in, for example, the discreteequilibrium or even the moment basis. Thus, our systemadditionally offers routines to construct the moment basis forfirst neighborhood stencils automatically. First, these routineshave to find q monomials that lead to an invertible momentmatrix M , which can then be orthogonalized in a second step.To illustrate this procedure, we first consider the D3Q27 stencil.Since the velocity vector components only contain the values {− , , } , moments with velocity powers larger than aliasa lower order moment. For example (cid:88) c q ∈S c qi f q = (cid:88) c q ∈S c qi f q if c qi ∈ {− , , } . (6)Similarly, moments with even exponents larger or equalthan 4 are aliases by corresponding moments with exponent2. Potentially non-aliased moments are thus c e q c e q c e q with e i ∈ { , , } . These systematically constructed monomialmoments can be used for the D3Q27 stencil to construct aninvertible moment matrix. Similarly, in 2D, this strategy alsoyields an invertible moment matrix for the D2Q9 stencil. ForD3Q15 and D3Q19 the situation is slightly more complex. ForD3Q19, we start with the possible monomial moments anddiscard moments that produce a zero line in the moment matrix,like e.g. the moments defined by the polynomials c q c q c q or c q c q c q . For this stencil, there are in total out of the possible monomial moments leading to zero lines, leaving independent rows. For the D3Q15 stencil this procedureas to be further refined. There, some monomial momentsproduce the same non-zero row in the moment matrix, whichare trivially linear dependent. lbmpy groups moments togetherthat yield the same row, resulting in 15 groups. One groupof moments, for example is [ c q c q , c q c q , c q c q c q ] . In eachgroup, we keep only the lowest order moments. If there ismore than one moment remaining, their sum is used. In caseof above example this leads to c q c q + c q c q . This systematicprocedure constructs moment matrices that span the same spaceas MRT matrices reported in [11], [12], [49].However, the constructed q independent moments are notorthogonal yet. For MRT methods, typically, an orthogonalmoment set is required. Using a symbolic Gram-Schmidtprocedure, lbmpy can orthogonalize the moments, eitherutilizing the standard scalar product, or a scalar productweighted by the lattice weights. The exact outcome of theGram-Schmidt orthogonalization depends on the ordering ofthe non-orthogonal moments that are put in. For reproducibleresults lbmpy sorts the input by moment order and within eachorder lexicographically. Before the orthogonalization, also thesecond-order moments are manually split into bulk and shearpart.
2) Equilibrium State:
The second element necessary for theconstruction of an MRT method is the equilibrium. It caneither be given in population space (3) or directly as a vectorof equilibrium moments m ( eq ) . In this mode, the user has fullcontrol over the equilibrium values and can create LBMs notonly for the Navier-Stokes equations but also for other partialdifferential equations.Beyond this, our meta-programming approach attempts toderive as much as possible from a more general formulation.Therefore, we provide functionality in lbmpy to derive equilib-rium values for hydrodynamic LBMs automatically. One wayto obtain a hydrodynamic discrete equilibrium is to computethe equilibrium moments directly from the continuous Maxwell-Boltzmann distribution f ( MB ) ( ρ, uuu, ξξξ ) = ρ (2 πc s ) D exp (cid:18) − || ξξξ − uuu || c s (cid:19) (7)with (cid:90) P ( ξξξ ) f ( MB ) ( ρ, uuu, ξξξ ) dξξξ. (8)In this case, the user first chooses a value for the speed of sound c s , typically c s = 1 / √ for first neighborhood stencils, thenthe integral (8) is evaluated symbolically with the help of sympy .The resulting continuous moments of the Maxwellian areused in the equilibrium moment vector m ( eq ) . Optionally themoments can be truncated to a given order in the macroscopicvelocity u . If the equilibrium is required in population space,it can be easily transformed using the assembled momentmatrix with M − m ( eq ) . For the cartesian product stencilsD2Q9 and D3Q27, this method yields exactly the standardequilibrium (3). For the D3Q15 and D3Q19 velocity sets,however, a different equilibrium is obtained. A comparison ofthe standard equilibrium and the equilibrium obtained with thismoment-matching technique can be found in [9]. The standard equilibrium for D3Q15 and D3Q19, including the weights, canalso be derived by lbmpy using a Hermite projection of (7).
3) Relaxation rates:
The third building block to fully definethe method are the relaxation parameters. In lbmpy , theuser can specify a relaxation rate separately for each of thepreviously selected moments. Each relaxation rate can eitherbe a constant value, a symbol, or an arbitrary expression oflocal or neighboring values. In the simplest case, the relaxationrate is a compile-time constant and equal for all time steps andlattice cells. This case allows the computer algebra system topre-evaluate and simplify expressions containing constants only,thus leading to significant savings. If the relaxation parameteris chosen as a symbol, it becomes a run-time parameter of thegenerated kernel function. In this case it can be changed, e.g.,in a configuration file of the final application without requiringre-compilation. Of course this comes possibly at the cost ofexecuting more FLOPs as compared to the pre-evaluation. Thethird option, where the relaxation rate is given as a symbolicexpression, gives the most modeling power and flexibility. Theexpression may contain any local or neighboring quantitieslike equilibrium or non-equilibrium moments. This allows theformulation of a wide range of turbulence models, where therelaxation rate needs to be adapted depending on shear rates.Entropically stabilized schemes like the KBC-type models [10]can also be described in this way. The relaxation rate expressionmay also contain values of other arrays, allowing for easycoupling of multiple LB schemes, e.g., for multiphase orthermal flows.
B. Collision in cumulant space
Recently an alternative collision space has been proposedin [20]. Before collision, cumulants of the distribution func-tion are calculated that are relaxed against their respectiveequilibrium value. Conceptually, cumulant collision operatorsare realized in lbmpy similar to collision operators in momentspace. The user specifies a set of cumulants, together withrelaxation rates. The cumulant equilibrium values are obtainedfrom the continuous Maxwellian. This allows the formulationof cumulant methods not only for D3Q27 and D2Q9 but alsofor D3Q19 and D3Q15 stencils.Cumulants can be succinctly defined through the cumulant-generating function K ( ξξξ ) = ln (cid:88) c q ∈S f q exp( c q · ξξξ ) . (9)The cumulants are computed by multi-differentiation of (9)and evaluating the derivative at zero. For example, the “bulkcumulant”, that we associated with the polynomical c qx + c qy is computed as ∂ K ( ξξξ ) ∂ξ (cid:12)(cid:12)(cid:12)(cid:12) ξξξ =0 + ∂ K ( ξξξ ) ∂ξ (cid:12)(cid:12)(cid:12)(cid:12) ξξξ =0 . (10)Originally, we implemented the cumulant transformation withthis approach in lbmpy , however, the resulting expressions getvery elaborate, especially for large stencils. sympy ’s commonubexpression evaluation capabilities then run an unpracticablelong time manipulating these expressions. Thus we havedeveloped an alternative multi-step transformation, where thepopulations are first transformed to moment space and then tocumulants. The moments are intermediate quantities that serveas common subexpressions. In lbmpy , we use Fa`a di Bruno’sformula [46] to derive the transformation of raw moments tocumulants and vice versa. C. Collision Model Examples
In this section, we demonstrate how LBMs can be formu-lated in lbmpy by constructing collision operators of varyingcomplexity.
1) SRT, TRT:
We start with the single- and two-relaxation-timecollision operators. Even if these methods are typically notderived in moment space, we use the MRT formalism for theseoperators as well to not introduce special cases. The challengewith this general approach is, however, that the simplificationsystem needs to be able to reduce the resulting expressions totheir short form.The following code example shows the definition of a D2Q9TRT method. Stencils are represented by a tuple of discretedirections with integer components. Common stencils, like theD2Q9, can be obtained by their name. This stencil is used toconstruct a set of independent raw moments using the algorithmdescribed above. d2q9 = get_stencil("D2Q9")moments = independent_raw_moments(d2q9) ω _e, ω _o = symbols(" ω _e, ω _o") ω s = [ ω _e if is_even_moment(m) else ω _o for m in moments]m_eq = maxwellian_moments(moments, dim=2,c_s=1/sqrt(3))trt = create_method(d2q9, moments, ω s, m_eq) In this example, the moment equilibrium values are computedfrom the continuous Maxwellian, and the relaxation rates aredefined for each moment. lbmpy offers various classificationfunctions for moments like the is_even_moment function,used here. Other functions can determine the order of a moment,or if it is related to shear or bulk viscosity. Putting theseelements together, the method is fully defined and can bedisplayed to the user in a Jupyter notebook [33] in tabularform, as shown below.Moment Equilibrium Relaxation rate ρ ω e x ρu ω o y ρu ω o x ρu + ρ ω e y ρu + ρ ω e xy ρu u ω e x y ρu ω o xy ρu ω o x y ρu + ρu + ρ ω e For better readability we denote moment polynomials usingvariables x, y and z instead of c qx , c qy and c qz . Note that noexplicit equilibrium formulation similar to (3) was necessaryto construct this method. Only the stencil, the continuousMaxwellian, and a systematically constructed set of indepen-dent raw moments have been used to derive this method.
2) MRT:
Next, we show how to construct a generic MRTmethod in lbmpy . We stick with the D2Q9 stencil to keepthe listing of the method tableaus short. Similar to the TRTmethod above, we start with a set of independent raw moments.For MRT methods, the moments have to be orthogonalized.In lbmpy we provide an orthogonalization routine based onthe Gram-Schmidt procedure. This routine either uses thestandard or a weighted scalar product. A common choiceis to use a scalar product weighted with the lattice weights,which we demonstrate in the code example below. If wewant to control bulk and shear viscosities using differentrelaxation rates, the second-order moments must be mod-ified before the orthogonalization. This is handled by the split_shear_bulk_moments function. We pass in thelist of all raw moments, containing the second-order moments x , y and xy . These are split into the bulk moment x + y andthe remaining xy and x − y moments. In 3D, this functionworks analogously. moments = independent_raw_moments(d2q9)moments = split_shear_bulk_moments(moments)moments = gram_schmidt(moments, d2q9,weights=get_weights(d2q9)) ω = symbols(" ω _:4") ω s = [0 if get_order(m) < 2 else ω [0] if is_shear_moment(m) else ω [1] if is_bulk_moment(m) else ω [get_order(m)-1] for m in moments]m_eq = maxwellian_moments(moments, dim=2,c_s=1/sqrt(3))mrt = create_method(d2q9, moments, ω s,to_incompressible(m_eq)) The Gram-Schmidt orthogonalization step then produces themoments listed in the first column of the following table. Thena list is constructed that defines the relaxation rate for eachmoment. Moments of order less than two are conserved andthe relaxation rate can be chosen arbitrarily. In this example,the relaxation rate is set to zero for these moments. Havingsplit up the second order bulk and shear moments, we canpick separate relaxation rates ω and ω for these. In thisexample, we choose a common relaxation rate for the third-and fourth-order moments.Moment Equilibrium Relaxation rate ρ x u y u x − y u − u ω xy u u ω x + 3 y − u + 3 u ω x y − y ω xy − x ω x y − x − y + 1 0 ω ompared to the TRT example, we have done anothermodification here. The equilibrium moments are modifiedto yield a so-called incompressible equilibrium [28]. Theincompressible equilibrium moments are obtained by writingthem as polynomial in the velocity u and substituting ρ = 1 in all terms that contain at least one velocity component, e.g., ρ + ρu → ρ + u .Having full information about an LB method in the form ofthe moment table, as shown above, enables us to analyzethe method using a Chapman-Enskog procedure symbolically,as long as relaxation rates are chosen constant. The primaryinput for this analysis are the moment equilibrium values. Theautomated analysis can show the user the approximated PDEas well as higher-order error terms. Additionally, it can derivethe connection between relaxation parameters and macroscopicparameters, e.g., viscosities. The following snippet shows theanalysis of the MRT method defined here. >>> ce = ChapmanEnskogAnalysis(mrt) >>> ce.get_bulk_viscosity()-1/9 - 1/(3* ω _1) + 5/(9* ω _0) >>> ce.get_macroscopic_equations()[0] ∂ _t ρ + ∂ _0 u_0 + ∂ _1 u_1
3) Boundary Conditions:
Similar to the collision operator,boundary conditions are also described in symbolic form.Boundary conditions have to specify the value of a populationthat is streamed in from a boundary lattice cell. Here is anexample of a velocity-bounce-back boundary that models amoving wall. def vel_bounce_back(f, c, method, vel):c_s = method.speed_of_soundw_q = method.weights[method.stencil.idx(c)]vel_term = 2 / c_s**2 * c * v * w_q return f.center(c) - vel_term
In the boundary definition, the user has access to the methoddefinition, that offers properties like speed of sound or latticeweights. The lattice direction c is an integer vector pointingfrom the fluid to the boundary cell. With this information,an expression for the missing population is constructed. Thepopulation field f and macroscopic properties can be accessedusing relative addressing, where the center is the fluid cell.Additional information, like in this example, the velocity of themoving wall, can be used. This data can be supplied by varioussources. In the simplest case, it is a compile-time constant value.It can also be an expression that depends on spatial coordinates,time step, local population values, or macroscopic quantities.It can also be supplied at runtime. In this case the data isread from a field or a sparse list data structure that stores thisinformation for every connected boundary cell. More detailswill be covered in the section on the algorithmic treatment ofboundary conditions. However, in all these cases, the boundarydefinition, as shown in the above example, does not change atall. The definition and implementation are strictly separated.Boundary conditions are defined per lattice link, not by latticecell. That means, for example, that for each link a differentvelocity can be prescribed.
4) Turbulence models:
Up to now, we have presented methodswith constant relaxation rates. The modeling power of LBMsstems partially from the ability to vary relaxation rates ona cell-by-cell basis, depending on local quantities. With thistechnique, one can for example, model non-Newtonian fluidsor implement turbulence models. To provide this modelingpower to the user, lbmpy does not only allow for compile-and run-time constants as relaxation rates. It can also takearbitrary expressions of neighboring distribution functions ormacroscopic quantities as relaxation rates. We illustrate thisfor the example of a Smagorinsky subgrid turbulence model.This model adds an eddy viscosity ν t to represent energydamping on unresolved scales [32]. The eddy viscosity iscalculated from the local strain rate tensor as ν t = ( C S ∆) | S | (cid:124) (cid:123)(cid:122) (cid:125) ν t (11)where C S is a constant and ∆ is a filter length chosen as 1 inlattice coordinates. | S | = (cid:112) S ij S ij is the Frobenius norm ofthe local strain rate tensor S ij = 12 ( ∂ i u j + ∂ j u i ) = − ω ρ Π ( neq ) ij . (12)This uses the fortunate property of LBMs that the strain ratetensor can be computed from local quantities, only using thesecond non-equilibrium moment [35] Π ( neq ) ij = (cid:88) q c qi c qj (cid:16) f q − f ( eq ) q (cid:17) . (13)Equation (12) contains the total relaxation rate ω which iscomputed from the total viscosity ν = ν + ν t which againdepends on the eddy viscosity ν t that we want to determine ω = 26 C S | S | + 6 ν + 1 . (14)Thus we have a system of two equations in ω and | S | that wenow like to solve for ω . Here it pays off that lbmpy is basedon the computer algebra system sympy where these steps canbe performed automatically: S, ω = symbols("|S|, ω ", positive=True)f_neq = pre_collision_symbols() - equilibrium_symbols() Π = frobenius_norm(second_order_moment_matrix(f_neq))eqs = [ Eq( ω , ω _from_ ν ( ν _from_ ω ( ω _0) + C_S**2 * S )),Eq(S, 3 * ω / 2 * Π ) ]effective_ ω = solve(eqs, [ ω , S])[ ω ] The resulting symbolic expression for the effective ω valuecan be be used in all places where in previous examples aconstant has been used. Thus, one can construct MRT orcumulant methods where some or all relaxation rates varylocally, using potentially different expressions for differentrelaxation rates. For brevity, we have shown here only theconstruction of a simple turbulence model. The possibility toemploy arbitrary expressions as relaxation rates in lbmpy canbe used to realize also more advanced turbulence models withnly little programming effort. Additionally, lbmpy also comeswith several pre-defined turbulence models. Thus the user doesnot have to perform the steps outline above manually, if onlya common turbulence model is required.
5) Entropic KBC Models:
Another important class of models,where relaxation rates are varied locally, are entropic LBschemes. In this section, we show how entropic MRT methods,labeled KBC models by the authors in [10], are realized in lbmpy . The central idea of these methods is to maximize adiscrete entropy measure S of the post-collision state. The freevariable that is tuned to obtain maximum entropy is a relaxationrate associated with higher-order moments. Single relaxationtime entropic methods also change the effective viscosity byvarying this single rate to maximize entropy. KBC modelspresent an improvement by using two relaxation rates: Onerate for the shear moments called ω s , and a second relaxationrate ω h that controls higher order moments. Only ω h is changedaccording to the entropy condition. The shear relaxation rate ω s is not altered, and thus also the viscosity remains constant.By choosing which moment is relaxed by which relaxationrate, one obtains different KBC variants, that are labeled bythe authors as KBC-N1 up to KBC-N4.In the original work [10], the notion of mirror states andaccording relaxation parameters is used. Here we use a differentnotation that is closer to the formalism of MRT methods. Westart with an arbitrary MRT method that uses two symbolicrelaxation rates ω s and ω h . The rate ω s must include theshear moments if shear viscosity should remain constant. Thecollision operator in population space is then of the form f (cid:48) q = f q − ω s ∆ s q − ω h ∆ s h , (15)where f (cid:48) q is the post-collision state, f q the pre-collision state,and ∆ s q , ∆ h q being the coefficients multiplying the relaxationrates. Then we need to maximize the entropy S ( f (cid:48) ) = − (cid:88) q f (cid:48) q ( ω h ) ln (cid:32) f (cid:48) q ( ω h ) f ( eq ) q (cid:33) (16)in every cell at every time step by varying ω h . Taking the firstderivative of (16) w.r.t. ω h we get the optimality condition (cid:88) q ∆ h (cid:34) ln (cid:32) f (cid:48) q ( ω h ) f ( eq ) q (cid:33) + 1 (cid:35) = 0 . (17)This condition could be solved numerically in every cell usingNewton’s method. However, in this case, a more efficientway can be devised. We expect f (cid:48) to be close to f ( eq ) andapproximate the logarithm around up to first order with ln( x ) ≈ x − . t The optimality condition then simplifies to (cid:88) q ∆ h f (cid:48) q ( ω h ) f ( eq ) q = 0 . (18)Inserting the post-collision value as f (cid:48) q = f q − ω s ∆ s q − ω h ∆ s h , and introducing the entropic scalar product (cid:104) a, b (cid:105) E := (cid:80) q a q b q (cid:104) f ( eq ) q (cid:105) − , we can solve for w h and obtain ω h = 1 + (1 − ω s ) (cid:104) ∆ s, ∆ h (cid:105) E (cid:104) ∆ h, ∆ h (cid:105) E . (19) To obtain this result, one has to replace f q = f ( eq ) q + ∆ s q +∆ h q and use (cid:80) q ∆ h q = 0 , which holds because of the massconservation property of the collision operator. All steps leadingto (19), are implemented using sympy , to obtain an automaticderivation of KBC methods from high-level principles.Using this technique, we can construct a wide range ofentropically stabilized methods, not only for D3Q27 stencils asin [10] but for D3Q19 and D3Q15 stencils as well. Furthermore,we offer a more costly but also more general numericalmaximization procedure for the post-collision state entropy,which is based on Newton’s method. This can e.g. be usedfor cumulant methods where the update is not linear in therelaxation rates any more as in (15), but has the quadratic form f (cid:48) q = f q − a ω s − a ω s − b ω h − b ω h .IV. C OMPUTE K ERNEL G ENERATION
All steps described up to now produce a symbolic representationof the collision operator
Ω : R q → R q . Together with thestencil, represented as a list of q discrete velocities with integercomponents, an efficient LB compute kernel must be generatedfor various hardware platforms. This process is discussed inthe following section. A. Simplification
To obtain an efficient formulation of the resulting computekernel, the symbolic collision operator must be rewritten ina form where as few as possible floating point operations(FLOPs) are required to compute post-collision values. This isa very challenging task, since the automatic operator derivationyields a highly inefficient formulation by default. Consider,for example, the case of an SRT model that is derived bytransforming populations to moment space, relaxing with asingle rate, and transforming back. The matrix products producelengthy expressions, that are mathematically equivalent to theusual SRT formulation but are now expressed using many moreFLOPs. With standard mathematical techniques, like expandingor factoring, terms can already be simplified considerably.However, the most significant reduction in the number ofFLOPs is achieved with common subexpression elimination(CSE). General CSE algorithms implemented in computeralgebra systems are not guaranteed to find the global optimumand have to rely on heuristics to find a reasonably good solution.These algorithms do not just identify common subtrees as itis typically done as a compiler optimization, but they also tryto rewrite the expressions in a form where they have morecommon subtrees.For an illustration of the optimization possible in lbmpy wepresent here results for the D3Q19 BGK method. The firstrow of Table I shows the number of FLOPs in the expressionsas they are produced by the automatic derivation. In total,this initial automatically generated code version needs 1263operations. To reduce cost, we first employ the simplificationand CSE capabilities of the sympy computer algebra systemdirectly. The results are labeled “
Only CSE ” in the table. Thisreduces the number of operations significantly, down to only261. However, a manually optimized implementation of the dditions Muls Divs Total
Only CSE: initial 686 574 3 1263sympy CSE 199 61 1
Custom: initial 686 574 3 1263expand 173 423 3 599quadratic velocity prod. 203 447 3 653expand 179 423 3 605factor ω ’s 179 305 3 487common quadratic term 131 161 3 295substitute existing subexpr. 119 119 3 241sympy CSE 119 73 1 Table I: Detailed simplification results for compressible D3Q19SRTBGK method developed by the authors requires only 204FLOPs. A value around 200 FLOPs is also reported by Welleinet al. [58]. The default simplification and CSE of sympy thusis unable to produce code as good as manually tuned, since itneeds about 30% more FLOPs than the best solutions known.We also tested the simplification capabilities of other computeralgebra systems, including Maple and Mathematica which alsocould not find simplifications competitive with hand-tunedcode. Therefore, it was necessary to develop a new set ofcustom transformations to rewrite the equations before they arepassed to the CSE function of sympy . These transformationsare listed in the order of application in the lower part oftable I as “
Custom ” transformations. Next to the name ofthe transformation we display the number of FLOPs after thetransformation has been applied. Some of these transformationsuse LBM application knowledge, e.g., they treat density andvelocity symbols differently.We now study these transformations one by one and describethem in detail. The initial formulation is first expanded, i.e.,transformed into a sum of products using a function providedby sympy . The next transformation called “quadratic velocityproducts” is specifically developed for LBM simplification.It picks out mixed quadratic terms in macroscopic velocitycomponents u i u j and replaces them by ( e − u i − u j ) / ,with a new subexpression e := u i + u j . This transformationmay seem counterintuitive since it increases the number ofFLOPs. However, it helps the following transformations toobtain better results. This is an example of a transformation thatrequires domain knowledge since this replacement may only beapplied to the velocity symbols. Next, a standard expansion isperformed. The previously introduced subexpression e preventsthis expansion from undoing the previous transformation. Thenext transformation uses a generic sympy function to factor outrelaxation rates. The “common quadratic term” step introducesa subexpression that is obtained by taking the expression forthe center point, setting all pre-collision values to zero, andrelaxation rates to one. For the TRT method this yields ρ − / ρ ( u + u + u ) . After this transformation, already existingsubexpressions like density and velocity are searched in theequations, and finally, a CSE from sympy is performed. With this custom simplification pipeline, we eventually arrive ata kernel that costs only FLOPs. Note that this is evenslightly better than the previously known and carefully hand-optimized version. The improvement compared to using onlythe generic simplification is . Note also, that the sameautomatic simplification and optimization steps can now beperformed for other LBM methods.Table II shows the total number of FLOPs for a selectionof LB schemes that lbmpy can generate and optimize. Wecompare methods that use the so-called compressible andincompressible equilibrium [28]. We also compare SRT, TRT,MRT, and the SRT with Smagorinsky turbulence model. Thetable does not display in detail which type of FLOPs eachmethod is composed of. However, all methods only requireadditions and multiplications with the following exceptions:All compressible models have one division by the density, andthe Smagorinsky methods additionally require two square rootoperations. The results in the first column are obtained byusing only the sympy
CSE. The third column uses the customsimplification strategy introduced above. The second columnalso uses the custom simplification strategy, with a modifiedCSE step at the end. In this CSE step, we first search forsubexpressions in terms that update opposing lattice directions.By construction, these contain terms that differ in sign onlyand are good common subexpression candidates. This step isthen again followed by a global CSE. This approach is labeled“direction CSE” since lattice directions are taken into account.The lowest FLOP count is marked for each method in boldface.Let us first discuss the results for all methods with one ortwo relaxation rates. We see that for SRT and TRT methodsthe custom simplification pipeline consistently leads to betterresults. It is generic enough to work not only for the SRTmethod it was designed for, but leads to good results for TRTmethods as well. Also turbulence models built on top of thesecollision operators are simplified better by the custom strategy,as shown in the table with the Smagorinsky example. Whetherthe direction-aware CSE is beneficial depends on the stencil.For the D3Q27, it gives the best or equal result across methods,for D2Q9 and D3Q19 it is helpful only for the SRT operators.We also have chosen two example MRT methods. One, that usesthe standard scalar product for moment orthogonalization, andone with moments that are orthogonal w.r.t. to the weightedscalar product. Second order shear and bulk moments arerelaxed with different rates, and for each order larger than 2a separate relaxation rate is chosen. Table II shows that thecustom simplification pipeline cannot handle MRT methods.A straight application of CSE obtains much better results forall tested MRT methods, regardless of the stencil.Currently we employ these three simplification options foreach method, and then automatically select the best one. Inthe future we plan to also use machine learning techniques tooptimize the application order of transformations or for findingnew transformations. nly CSE Custom withdirection CSE Custom,default CSE
D2Q9 compr. SRT 113
90 90 incompr. SRT 107
75 75 compr. Smag. 137
122 122 compr. TRT 114 110 incompr. TRT 108 103 compr. MRT
349 317compr. weighted MRT
350 325
D3Q19 compr. SRT 261
193 193 incompr. SRT 252
162 162 compr. Smag. 306
251 251 compr. TRT 262 233 incompr. TRT 253 225 compr. MRT
947 903
D3Q27 compr. SRT 444
370 370 compr. TRT 446
Table II: Total number of FLOPs for different LB schemes.The “Only CSE” columns runs only a CSE from sympy . The“Custom with direction CSE” runs the custom simplificationpipeline, then a PDF direction-aware CSE followed by astandard CSE. “Custom with default CSE” is the similar, butwithout the direction-aware CSE.
B. Collision Operator to Stencil1) Streaming and Collision:
After simplification we havethe collision operator given as a function R q → R q . Itis represented by a list of q symbolic expressions for thepost-collision population values accompanied by a set ofsubexpressions. The next stage transforms this formulationinto a stencil representation.The stencil representation and all following low-level trans-formations are part of the pystencils package that is alsodeveloped by the authors [7]. pystencils generates stencilkernels, i.e., routines that iterate over arrays, applying thesame operation on every cell. It distinguishes between spatialand index dimensions. Only spatial dimensions are iterated over,while index dimensions are used to address values stored insidea cell, e.g., the q populations for a PDF array or componentsof a vector field. The central concept of pystencils are fields,and field accesses. A field is defined by a name and its numberof spatial and index coordinates. Fields are indexed relatively,so the field access f[1][0](q) , for example, refers to the q ’th population value of the east neighbor cell in a 2D setup. pystencils is built on top of SymPy , and field accesses can beused just like a built-in symbol. The collision operator can betransformed into a stencil representation by replacing the pre-and post-collision symbols by field accesses. Two additionalpieces of information are required for this process. The userhas to choose the data layout of the population array and a https://i10git.cs.fau.de/pycodegen/pystencils Figure 2: Visualization of stream-pull-collide update patternusing two arrays. The left part encodes the reads of pre-collisionvalues, the right part shows where post-collision values arewritten to.kernel type that describes the operations done inside a kernelcall. lbmpy supports three different population storage options. Thesimplest approach is to have two arrays, where, during onekernel invocation, one array is read-only and the second arrayis write-only. For this storage pattern the system can generatea fused stream-pull-collide, a fused collide-stream-push, or acollision-only kernel. For a pure LBM simulation that is notcoupled to other simulation models, typically a stream-pull-collide kernel gives the best performance. In lbmpy the dataaccess patterns are encoded by the field accesses where pre-collision values are loaded from, and field accesses where post-collision values are written to. These are visualized in fig. 2 fora stream-pull-collide kernel. This mechanism cleanly separatesthe LB method definition from algorithmic- and data structureaspects, avoiding any code duplication.Besides the simple two array swapping technique, lbmpy supports also more advanced storage patterns that operateon a single PDF array and thus require only half the memory.Supported single-array schemes are the AA pattern [3] andthe esoteric twist (EsoTwist) update scheme [19]. Single arraystorage schemes that introduce a data dependency between cellupdates like [44] are not supported.Figure 3: AA update pattern. The two leftmost schematicsshow the even time step consisting of an in-place collisionwith inversed storage of populations. The odd time step (right)is a fused stream-pull, collide, stream-push step.To be able to process all cells in parallel, while only havinga single array for PDF storage, the AA pattern needs twodifferent access patterns for even and odd time steps (fig. 3).This also leads to two different kernels that have to be run in analternating fashion. The different data layout after even and oddsteps may complicate boundary handling and coupling the LBMo other solvers, when traditional implementation techniquesare used. With our code generation approach this additionalcomplexity can be handled automatically. The symbolic, highlevel formulation of method, boundaries, and update schemeis sufficient to e.g. generate boundary handling for even andodd steps automatically.Figure 4: Esoteric twist split in even (left) and odd (right) timestep kernels. Black arrows indicate reads, red arrows writes.The esoteric twist pattern also requires only a single array. Incontrast to the AA pattern, it was designed to not require aneven and odd time step. If the populations for different latticevelocities are stored in separate arrays, a pointer swappingtechnique can be used for streaming. In lbmpy , however, wedo not use this technique, in order to keep a common kernelinterface with a single population array. Instead we also usean even and an odd time step for the EsoTwist pattern as well(fig. 4).
2) Boundary Conditions:
In this section we discuss in moredetail how boundary conditions are realized algorithmically.One option is to leave the LB kernel unchanged, and runseparate boundary handling kernels before. These kernelsprepare the population array by writing values that will bestreamed in from boundary cells. lbmpy can take symbolicboundary definitions, as shown above, and generate one kernelper boundary. In the simplest case, these boundary kernelsoperate on a rectangular subdomain, e.g., at the borders ofthe computational domain. This very simple, but also verycommon case can thus be handled in the most efficient waypossible.For more general boundary shapes a flag field is used. The flagsstore a bitmask in every cell that encodes the type of boundary.The flag field is initialized by the user using image/voxel dataor with the help of surface meshes. The boundary conditionkernel could iterate over the full domain, masking out cells, butespecially if boundary conditions are static, this would be ratherinefficient. Thus, lbmpy also offers an alternative approach,where instead of iterating over all cells, a pre-processing stepextracts boundary cell coordinates from the flag field andcreates a list of indices for each boundary condition. Thisindex list contains the spatial coordinate of the boundary celltogether with the lattice direction to the neighboring fluidcell. Consequently, this list has one entry per boundary link.For each boundary link, custom boundary data can be storedadditionally, e.g., the wall velocity for a velocity bounce-backboundary. The flag field then only acts as a convenient way tosetup boundaries. During the simulation itself, it is not required any more since all necessary information is stored in the listdata structure to accelerate the processing.So far we have studied time steps where the boundaries aretreated in a separate kernel. With lbmpy , boundaries can alsobe compiled directly into the LB compute kernel. Then aconditional is added to the kernel that determines the cell typeeither by using a flag field or by a boolean expression thatdepends on the spatial coordinates.Using the information about the data access pattern, thesymbolic formulation is transformed to obtain a concreteboundary assignment. We point out that it is particularlybeneficial to automate the error prone implementation ofboundary conditions for the single-array patterns AA orEsoTwist.All boundary treatment options discussed above are not newsince they have already been implemented in existing frame-works or applications. The new contribution here is that codefor all these options can be automatically generated avoidingtedious manual coding and debugging. Since all versions canbe generated easily, this makes it possible to benchmark allversions and to choose the fastest version for a specific setup.Furthermore, there is no trade-off between flexibility andperformance any more. It is now possible to compile a specificboundary treatment into a kernel to get an application-specificimplementation with the best performance. Different from ahand-tuned version of the same method, the lbmpy approachkeeps the system maintainable and extensible. The separationof concerns is realized on the symbolic abstraction level.
C. Transformations in the Intermediate Representation
In this section we describe low level optimizations executedon the intermediate representation of pystencils with the goalto further accelerate the LB compute kernels. pystencils isdesigned as a modular package that allows the user to writecustom code transformations specific to the application.
1) Splitting inner loop:
For LB kernels we expect that thememory interface to be the performance limiting factor, ifdomains exceed the capacity of the outer level cache. The firstoptimization we discuss here, aims to increase the maximumattainable bandwidth of the kernel by reducing the number ofparallel load/store streams from/to memory. A standard LBkernel iterates over all cells, loads all q pre-collision valuesat once, computes the post-collision values and stores all q of them. This leads to q parallel load and store streams.Reducing the number of parallel streams to memory canincrease the obtained bandwidth [57]. Therefore we developan automatic transformation that splits the innermost loopinto multiple smaller loops. To avoid the re-computation ofcommon subexpressions in every inner loop, buffer arrays areintroduced. The first inner loop computes density and velocityand writes them to the buffer arrays. The following loopsthen handle only two lattice direction updates and have onlytwo parallel load and store streams. Algorithm 1 shows thestate after the transformation in pseudo-code. It assumes asimple two-field storage pattern with source and destinationarray. There are different options on how to exactly perform lgorithm 1 Stream-collide kernel with split inner loops for all slices y, z do ρ arr ← array[x-size]u arr ← array[x-size] for line x do f ← src [ x, y, z ] ρ arr[x] ← ρ ( f ) u arr[x] ← u ( f ) dst[x, y, z, center] ← Ω( f, ρ arr[x] , u arr[x] ) end forfor line x do f ← src [ x, y, z ] dst[x, y, z, east] ← Ω( f, ρ arr[x] , u arr[x] ) dst[x, y, z, west] ← Ω( f, ρ arr[x] , u arr[x] ) end forfor line x do f ← src [ x, y, z ] dst[x, y, z, north west] ← Ω( f, ρ arr[x] , u arr[x] ) dst[x, y, z, south east] ← Ω( f, ρ arr[x] , u arr[x] ) end for ... (more loops for remaining directions) end for this transformation. One free parameter is the number ofdirections that are updated in the inner loops. In the example,we update two opposing directions at once, since these updatesshare many common subexpressions. One could also create aseparate inner loop for each direction, or group more than twodirections together. This transformation is also parametrized bythe common subexpressions that are pre-computed in temporaryarrays. lbmpy can introduce additional temporary arrays forother subexpressions besides density and velocity as well. Aheuristic is used to determine subexpressions that are computeintensive enough to justify introducing a temporary array forthem. It is important that all temporary arrays fit into the innerlevel cache so that they do not generate additional pressureon the memory interface. In the simple example, as shown inalgorihm 1, the temporary arrays grow with the domain sizein x -direction. To make this optimization work for arbitrarydomain sizes, the inner loop is blocked before splitting it up.The chunk size can be selected such that the arrays fit into L1cache.
2) OpenMP and SIMD vectorization:
All LB kernels aredesigned in a way that cells can be updated in parallel. pystencils uses the fact that loop iterations are independent toautomatically parallelize the kernel with OpenMP. By defaultthe outer loop is parallelized using a static scheduling strategy.If the domain size is known at compile time, and the outerdimension is very small, pystencils uses OpenMP collapse to increase the number of parallel interations.Knowing that iterations are independent, allows pystencils tovectorize the code. To have full control over the vectorizationprocess, we do not rely on compiler auto vectorization orpragma-based approaches but generate C code with SIMD intrinsics. pystencils currently supports SSE, AVX, AVX2 andAVX512 vector instruction sets. If the data layout and alignmentof the population array is known at compile time, we generatealigned load/store instructions where possible.
3) Non-temporal stores:
The intrinsics-based vectorizationallows us to explicitly use non-temporal (NT) stores, also called streaming stores , in kernels that use two population arrays. Thisoptimization reduces the total amount of data that has to betransferred from/to memory. By default, modern CPUs havea “write-allocate” or “read for ownership” cache policy [61].This means that a store operation causes the respective cacheline to be read into cache and thus generates twice the memorytraffic that is actually required. This actually causes q valuesto be transferred over the memory interface per lattice cell. Thecustom vectorization allows us to change the store instructionsfrom default to streaming stores that bypass the cache. Thenonly q values have to be loaded and stored per cell. So thisoptimization can increase the performance of two-array LBkernels by a factor of 1.5, assuming they are memory-boundand the PDF array does not fit into the outer level cache. D. Framework Integration
The intermediate representation of the compute- and boundarykernels is finally transformed by a backend to either C, CUDAor OpenCL code. For each kernel a C function with a well-defined interface is generated. Arrays are passed in as rawpointer, together with shape and stride information that definethe memory layout of the arrays. Symbolic quantities thathave not been replaced during the code generation processautomatically become parameters to the generated C function,e.g., values for relaxation rates or constant external forces.This simple interface was chosen, such that the generatedkernels can be called from a variety of different languagesand can be easily integrated into existing frameworks. In thissection we describe different ways of utilizing lbmpy . Thefirst option allows the user to completely work in a Pythonenvironment, preferably an interactive Jupyter notebook for aconvenient display of symbolic expressions. There, the userderives the LBM symbolically and passes the method definitionto lbmpy . After automatic simplification and optimization thegenerated C/CUDA/OpenCL code is automatically compiledand dynamically loaded as a Python module. The compilationprocess is fully transparent to the user. The optimized, shared-memory parallel kernel can then be directly called from Python.Data is stored in numpy for CPU simulation or in gpuarray ’sfrom the pycuda package for GPU simulations. In this mode lbmpy offers a flexible and fast prototyping environment forLB methods, where simulations can be run on a single nodeor a single GPU.For distributed memory parallelization we use the WA L-B
ERLA framework [16], a multiphysics software system thatis optimized for massively parallel simulations with stencilcodes. The distributed memory parallelization uses a block-structured domain partitioning based on a forest of octreesand is characterized by excellent scalabilty since it usesno central nor globally shared data structures so that nolobal communication is necessary [51]. This fully paralleldata structure enables adaptive grid refinement and dynamicload balancing between MPI processes [52], [5]. WA LB ERLA has Python bindings [8] that allow for simple distributedsimulations with lbmpy generated kernels directly from Pythonon a uniform grid. For advanced use cases, e.g., those thatrequire grid refinement, the user has to switch to C++ as thedriving language. Integrations of lbmpy into the CMake buildsystem of WA LB ERLA control the generation of LB computekernels, boundary kernels and packing/unpacking kernels fordistributed memory MPI communication. The reason why wealso generate communication kernels are the single-field AAand EsoTwist storage patterns. Manually determining whatvalues have to be sent to neighboring processes is tediousand error-prone in these cases. Since the compute kernels areavailable in symbolic form, we can extract that information andgenerate the necessary communication routines automatically.V. P
ERFORMANCE R ESULTS
In this section we present benchmark results using the auto-matically generated LB kernels and compare them to manualimplementations with different optimization level.
A. Single Node Benchmark1) Hardware:
We first investigate the single-node performanceon two test systems. We scale all kernels on one socket of twoIntel Xeon processors with different microarchitecture. Thefirst system is an Intel Xeon E5-2695v3 Haswell system with14 physical cores per socket. For benchmarking, we deactivatethe turbo mode of this processor and set the frequency to afixed value of 2.3 GHz using the likwid tool suite [56]. Thesecond system is an Intel Xeon Gold 6148 CPU Skylake thathas 20 cores per socket with a fixed frequency of 2.4 GHz. TheSub-NUMA clustering features of both systems are switchedoff to have one NUMA domain per socket. We use transparenthuge pages and disable automatic NUMA balancing in theLinux kernel. For all benchmarks we use a domain size of × × that is too large to fit in the outer level cacheof any of the tested systems.To find an upper bound for the possible performance, assumingkernels are memory-bound, we use bandwidth measurementsfor both systems from [60]. Table III shows the measured copy bandwidth for both machines where write-allocateshave already been taken into account. Additionally, we usebandwidth measurements of scenarios that closely mimic thememory access behaviour of the D3Q19 LB kernels. Kernelswith two-array population storage are compared to a streambenchmark with 19 parallel streams and non-temporal stores,labeled copy-19-nt-sl . Kernels with a single-array updatepattern are compared to a benchmark that updates 19 valuesin place called update-19 . The upper bound for the kernels,as measured in million lattice updates per second (MLUP/s),the bandwidth is divided by the number of bytes that have tobe transferred per lattice cell. We assume double precision forall kernels. Thus each cell update requires · q · bytes percell. processor Xeon E5-2695v3 Xeon Gold 6148micro architecure Haswell Skylakecores per socket 14 20frequency 2.3 GHz 2.4 GHz Measured Bandwidths copy copy-19-nt-sl update-19
Table III: Test system specification with measured bandwidthsfrom [60]. copy uses one load and store stream, write-allocateis already taken into account. copy-19-nt-sl uses 19load and store streams and non-temporal stores. update-19 updates 19 values in-place.The benchmark codes are compiled with Intel compiler 19.0.2if not specified otherwise. Where explicitly noted, the GCCin version 7.4.0 is used. For both compilers we set theoptimization flag -O3 , enable AVX512 on Skylake and AVX2on Haswell, and switch on fast math flags that allow thecompiler to reorder floating point operations. For the intel com-piler these are -fp-model fast=2 -no-prec-sqrt-no-prec-div and for GCC -ffast-math . All kernelsare parallelized with OpenMP.
2) Two-array kernels and comparison to manual implemen-tations:
As discussed before, there is a trade-off betweencode quality and performance when developing LB kernelsmanually in C/C++. Code is optimized by specializing it fora particular scenario. To illustrate this trade-off we compare lbmpy generated kernels with two manual implementations. Thecode of the manual implementations can be accessed at [4]. Forthis test we restrict ourselves to a SRT collision operator. Thefirst manually implemented code is written in a stencil-agnosticway, where lattice velocities and stencil weights are abstractedaway through template meta-programming. Theoretically, thecompiler should be able to resolve these indirections fullyat compile-time. Relaxation rates are also passed in via atemplated functor, to enable a flexible integration of turbulencemodels. While the main aim of this kernel is to be as genericas possible it is still restricted to a single collision operator anda single two-array population storage pattern. But it is easilyreadable and extensible.The second manual implementation is written specificallyfor a D3Q19 stencil. All loops over lattice directions aremanually unrolled, expressions are simplified by leaving outmultiplications with zero lattice direction components, andcommon subexpressions are eliminated. These steps lead tocode duplication and decreased readability, but may lead tobetter performance. These optimization steps should not benecessary since the compiler should be able to do them auto-matically. However, the unrolled stencil-specific version is thebasis of further optimization like loop splitting. Figure 5 showsbenchmark results on the Skylake system for both manuallyimplemented kernels using GCC and the Intel compiler. Wecan see that indeed the Intel compiler (right) was able toresolve the compile-time abstractions, such that the genericversion is as fast as the stencil-specific one. However, GCC
Cores M L U P / s SRT D3Q19, Skylake, GCC
Cores
SRT D3Q19, Skylake, Intel Compiler manual, genericmanual, stencil specificlbmpy, NT:0, Split:0 lbmpy, NT:0, Split:1lbmpy, NT:1, Split:0lbmpy, NT:1, Split:1
Figure 5: Comparison of kernels with different optimization level on Skylake using a BGK method with two-array populationstorage on a × × domain. Horizontal lines indicate the roofline estimate using the measured copy-19-nt-sl (lower) and copy bandwidth (higher).cannot optimize the generic code automatically, only obtainingabout half the performance. The manual implementations scaleperfectly, but are far from utilizing the available bandwidth onthe system. The generated kernel without loop splitting andnon-temporal stores already performs better than the manualimplementations. This kernel is explicitly vectorized withAVX512 SIMD intrinsics and uses pointer arithmetic to accessthe population arrays, whereas the manual implementationsuse getter/setter methods of an array class. Splitting the innerloops lets the kernel saturate at about 200 MLUP/s. Due to thewrite-allocate strategy in total 1.5 times more data is movedacross the memory interface than necessary. As can be seen inthe plot, the performance of this kernel is consequently alsoabout a factor of 1.5 worse than the best kernel with NT stores.Activating NT-stores results in the expected performance ofabout 300 MLUP/s on this system, very close to the maximal304 MLUP/s predicted by the roofline estimate obtained withthe copy-19-nt-sl bandwidth. So the loop splitting andnon-temporal stores optimizations are indeed necessary toobtain best possible performance on this system. All manualimplementations, where these optimizations have been applied,are lengthy and hard to read, due to the unrolled loops andthe usage of SIMD intrinsics. These hand-optimized codesare not only difficult and time-consuming to develop, but alsotheir maintainability and flexibility have been sacrificed forperformance. With code generation it is possible to resolvethese conflicting goals. Figure 5 also shows, that the generatedcode performs consistently across different compilers, sinceall abstractions are already transformed to perform with bestpossible efficiency by the code generation system, leaving onlystandard optimizations to the back end compiler.Figure 6 shows the corresponding results for the Haswellsystem. Overall the behavior of this older system is similar to Cores M L U P / s SRT D3Q19, Haswell, Intel Compiler manual, genericmanual, stencil specificlbmpy, NT:0, Split:0 lbmpy, NT:0, Split:1lbmpy, NT:1, Split:0lbmpy, NT:1, Split:1
Figure 6: Two-field BGK D3Q19 kernels on Haswell. Config-uration and roofline indicators as in fig. 5.Skylake, with the exception that the system is apparently notable to handle parallel non-temporal store streams, as theversion with NT-stores without loop splitting performs verypoorly.
3) Kernels with AA pattern and boundary handling:
Next,we show performance results for single-array kernels that usethe AA update pattern. We use the TRT collision operator forthese benchmarks. SRT and TRT kernels have very similarperformance characteristics, because they have about the samenumber of FLOPs. In fig. 7 we compare the best two-fieldversion with split loops and non-temporal stores to the corre-sponding AA kernel. The two-array kernel saturates at about13 cores, the AA version achieves the highest performance
Cores M L U P / s TRT, Skylake
AA D3Q19 lbmpyAA D3Q27 lbmpyTwo-array D3Q19 lbmpy NT:1, Split: 1AA D3Q19 list-based, Wittmann et al.
Figure 7: AA pattern, TRT on Skylake. Roofline estimate uses update-19 bandwidth.already with 6 cores. The additional development effort thatis required for the AA pattern pays off not only in half thememory consumption but also in single core performance.For the AA kernels the NT-store optimization is not applicable,since all values are updated in-place. The inner loop splitting,however, may be beneficial. Since there are two differentkernels for even and odd time steps, there are in total fouroptions, where the splitting transformation has been applied tonone, only one, or both kernels. We find, that loop splittingdoes not help in this case. All four options yield almost equalperformance results. Thus, fig. 7 only shows the version whereneither of the two kernels has been split.We can also see, that the roofline limit based on the measured copy-19 bandwidth is a very good model for the performanceon the full socket. The D3Q27 version saturates at very closeto the expected value, that is by a factor of / lower thanthat of the D3Q19 stencil.Figure 7 also shows performance results of the TRT LBbenchmark kernel by Wittmann et al. [60]. From this benchmarkwe use the fastest kernel list-aa-pv-soa on a channelgeometry. It also uses the AA pattern in a SoA layout. Incontrast to the lbmpy kernels, it operates on a sparse list datastructure, such that only populations in fluid cells have to bestored. Also, the benchmark kernels have boundary handlingbuilt in, while the lbmpy results in fig. 7 show the performanceof the compute kernel only.The boundary handling performance of lbmpy is investigated infig. 8. It shows the lbmpy kernels where two different boundaryhandling approaches are used for a channel scenario, wherenon-periodic boundaries are set on all sides. The simplest optionis to generate separate, external kernels that handle boundaries.Since cache lines containing population at the border must beloaded twice during a time step, the final performance obtainedat the full socket is decreased by about 15%. The secondapproach introduces conditionals in the compute kernel. Withthis approach we can obtain about the same performance on Cores M L U P / s TRT, Skylake, AA Pattern
D3Q19 lbmpy: ext. boundaryD3Q19 lbmpy: compiled-in boundaryD3Q19 sparse, Wittmann et al.
Figure 8: TRT collision operator on Skylake using AA patternwith different boundary options.the full socket as the pure compute kernel, but the performanceon a few cores is much lower. The intrinsics-based SIMDvectorization in pystencils cannot handle the conditionals in anoptimal way yet. This limitation is expected to be remedied infuture work.
4) Advanced collision operators:
The goal of code generationin lbmpy is not to make a single collision operator fast, butprovide a framework that is can obtain good performance fora wide range of different LBMs. Figure 9 shows results fordifferent D3Q19 LB schemes on the test systems using theAA update pattern. The TRT results, we have seen above areincluded for reference again. A slightly more complex schemeis the BGK operator with included Smagorinsky turbulencemodel. In this kernel, the relaxation rate is determined ona cell-by-cell basis. The computation of the adapted rate isdone on the fly inside the kernel, to not introduce additionalmemory accesses. Schemes with variable relaxation rates areoftentimes implemented in a way where the rates are computedin a separate kernel and stored into an additional array, forflexibility reasons. This is not necessary in lbmpy , so that theSmagorinsky kernel obtains identical performance as the TRTkernel on the full socket. On Skylake also the performanceon small core counts is almost identical to the TRT kernel,whereas on Haswell the additional computational complexity,like e.g., the two sqrt operators per cell, lead to lower singlecore performance compared to TRT.Next, we investigate performance characteristics of an MRTkernel with weighted orthogonal moments. It has four relaxationrates, two for controlling shear and bulk viscosity separately,one for third, and one for forth order moments. All relaxationrates remain symbolic at compile time and become run timeparameters. lbmpy is capable to optimize this model so that italmost runs as fast as the TRT model on both test architectures.This is true also for other MRT models that the system cangenerate, e.g. weighted/unweighted moment orthogonalizationor compressible/incompressible equilibria.
Cores
Haswell
Cores
Skylake
TRT Smagorinsky MRT Cumulant Entropic
Figure 9: Comparison of different LB collision operators. All kernels use the AA pattern and a D3Q19 stencil.Figure 9 also contains measurements for a D3Q19 cumulantmethod. The non-linear transformation to cumulant-spacemakes this collision operator more compute intensive thanMRT methods. Nonetheless, lbmpy can optimize the cumulantkernel such that it saturates the available memory bandwidth onboth systems. Additionally, we try an entropic method of KBCtype. Shear and bulk viscosity is kept fix, the relaxation ratefor higher order moments is chosen adaptively to maximizeentropy. This method is too compute intensive to be memory-bound on Haswell, but on a full Skylake socket it achievesperformance similar to the simpler methods.Summarizing our findings, after careful optimization there isno performance penality using complex LB collision operators.On modern CPU architectures all LBM implementaions arememory bound when they are properly optimized. Thisoptimization, however, is only achievable by tedious manualcoding by experts or by using automatic code generation withtools like lbmpy . B. Scaling Benchmark
Integrating the generated lbmpy kernels into the WA LB ERLA framework allows us to run large scale simulations on dis-tributed memory systems. We use the MPI communicationcapabilities of WA LB ERLA together with generated serializa-tion/deserialization kernels to run a large parallel simulations.In contrast to previous work [23], [6], where scaling results formanual implementations of TRT two-field kernels have beenshown, we demonstrate the performance of a more complexMRT kernel with AA pattern here. As we have shown above,this kernel runs as fast as a SRT or TRT collision operator onthe full node when properly optimized.As test system the SuperMUC-NG supercomputer in Munichis used. It consists out of Intel Xeon Platinum 8174 processorswith Skylake architecture. Each node has two sockets with 24 Cores M L U P / s p e r c o r e Weak Scaling MRT on SuperMUC-NG block size (300, 100, 100)block size (16, 16, 16)
Figure 10: Weighted orthogonal MRT method with AA updatepattern scaled on SuperMUC-NG up to 147,456 cores (3072nodes) using a channel geometry.physical cores each. We run a weak scaling setup up to the3072 compute nodes that we have access to. This is about halfof the full machine size.Figure 10 shows the scaling results of a channel geometry. Thedomain is partitioned into equally sized blocks and each blockis assigned to a physical core. The machine is best utilizedwhen choosing large block sizes, since the communicationoverhead is kept small in this case. For a block size of (300 , , , we observe perfect scalability up to all 147,456cores used. In this configuration we obtain about 1972 GLUP/son half of SuperMUC-NG. Besides this weak scaling scenariodemonstrating maximal GLUP/s rates, we may alternativelywant to maximize the number of time steps per second. Tollustrate the capabilities of lbmpy with WA LB ERLA we alsostudy a scaling scenario with much smaller block size. fig. 10shows the scaling behavior for a block size of . Herewe achieve still good scalability up to 1024 nodes, thenthe performance in GLUP/s drops to approximately 40% ofthe performance that was observed for the large block size.Note however, that in this configuration we can execute 1327time steps per second. Note also that on the 3072 nodes onSuperMUC-NG this is still for an LBM grid consisting of × LBM cells.VI. C
ONCLUSION AND O UTLOOK
In this article, we presented a programming system named lbmpy that supports the flexible creation of highly optimizedparallel LBMs. The scope of lbmpy are moment-based MRTmethods plus cumulant and entropically stabilized collisionoperators. All methods can be created with locally varyingrelaxation parameters so that various turbulence models canbe realized or also models for non-Newtonian fluids. lbmpy automatically optimizes the compute kernels with domain-specific transformations and it can produce codes that employmemory-efficient single-array population storage. Even forcomplex LBM models, the kernels generated by lbmpy canreach the same performance as manually optimized state-of-the-art TRT implementations of [60]. After automatic optimization,all methods are memory-bound on a recent Skylake system.Thus, also advanced collision operators can be used withoutperformance penalty, provided that large domain sizes are usedand memory bandwidth is the bottleneck. Through integrationinto the HPC framework WA LB ERLA , the lbmpy -generatedmethods can be executed on large scale distributed-memorysystems with excellent scalability.The lbmpy / WA LB ERLA programming system supports thecomputational science workflow. It automizes the tediousand error prone development steps. The support of lbmpy is vertically integrated : it starts with the development ofadvanced kinetic schemes, it assists the development of scalableparallel codes, and it includes advanced hardware-specificcode optimizations. Note that the derivation of modern LBschemes will usually require expertise in mathematics andphysics by specialists in LBMs, while the nodel-level kerneloptimization for modern CPU microarchitecture will require adetailed understanding of CPU microarchitecture. In this sense lbmpy / WA LB ERLA is an excercise in interdisciplinary co-designto create advanced simulation software for future extreme schalecomputing. The code-generation paradigm permits a higherlevel of abstraction than it can be realized by conventionalsoftware engineering methods. These new, application-specificmethods of abstractions can only be realized with automaticcode generation. They help to resolve the fundamental conflictbetween flexibility of software and the need for hardwarespecific optimizations. The approach taken in lbmpy offersa road to performance portability and thus to improve thesustainability of scientific software. VII. A
CKNOWLEDGEMENTS
The authors are grateful for funding received through theproject HPC2SE (01ICH16003D) from the Bundesministeriumf¨ur Bildung und Forschung (BMBF). Additionally we aregrateful to the Regionales Rechenzentrum Erlangen, theLeibniz-Rechenzentrum in Garching and the Swiss NationalSupercomputing Centre in Lugano for providing compute timeon their HPC systems. R
EFERENCES[1] A
LNÆS , M. S. ; L
OGG , A. ; Ø
LGAARD , K. B. ; R
OGNES , M. E. ;W
ELLS , G. N.: Unified form language: A domain-specific language forweak formulations of partial differential equations. In:
ACM Trans. Math.Software
40 (2014), Nr. 2, S. 9[2] A
NSUMALI , S. ; K
ARLIN , I. V. ; ¨O
TTINGER , H. C.: Minimal entropickinetic models for hydrodynamics. In:
Europhysics Letters
63 (2003),Nr. 6, 798–804. http://dx.doi.org/10.1209/epl/i2003-00496-6. – DOI10.1209/epl/i2003–00496–6. – ISBN 0295–5075[3] B
AILEY , P. ; M
YRE , J. ; W
ALSH , S. D. ; L
ILJA , D. J. ; S
AAR , M. O.:Accelerating lattice Boltzmann fluid flow simulations using graphicsprocessors. In:
IEEE, 2009, S. 550–557[4] B
AUER , M. :
Implementation manual LB kernels . https://github.com/lssfau/walberla/blob/48bd19800b0c46030ae7f5e510e896a9154d78b8/apps/benchmarks/UniformGridGenerated/ManualKernels.h, 2020. –[Online; accessed 24-January-2020][5] B
AUER , M. ; E
IBL , S. ; G
ODENSCHWAGER , C. ; K
OHL , N. ; K
URON , M.; R
ETTINGER , C. ; S
CHORNBAUM , F. ; S
CHWARZMEIER , C. ; T H ¨ ONNES ,D. ; K ¨
OSTLER , H. u. a.: waLBerla: A block-structured high-performanceframework for multiphysics simulations. In:
Computers & Mathematicswith Applications (2020)[6] B
AUER , M. ; E
IBL , S. ; G
ODENSCHWAGER , C. ; K
OHL , N. ;K
URON , M. ; R
ETTINGER , C. ; S
CHORNBAUM , F. ; S
CHWARZMEIER ,C. ; T H ¨ ONNES , D. ; K ¨
OSTLER , H. ; R ¨
UDE , U. : waLBerla: Ablock-structured high-performance framework for multiphysics sim-ulations. In:
Computers & Mathematics with Applications (2020).http://dx.doi.org/https://doi.org/10.1016/j.camwa.2020.01.007. – DOIhttps://doi.org/10.1016/j.camwa.2020.01.007. – ISSN 0898–1221[7] B
AUER , M. ; H ¨
OTZER , J. ; E
RNST , D. ; H
AMMER , J. ; S
EIZ , M. ;H
IERL , H. ; H ¨
ONIG , J. ; K ¨
OSTLER , H. ; W
ELLEIN , G. ; N
ESTLER ,B. u. a.: Code generation for massively parallel phase-field simulations.In:
Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis
ACM, 2019, S. 59[8] B
AUER , M. ; S
CHORNBAUM , F. ; G
ODENSCHWAGER , C. ; M
ARKL ,M. ; A
NDERL , D. ; K ¨
OSTLER , H. ; R ¨
UDE , U. : A Python extensionfor the massively parallel multiphysics simulation framework waLBerla.In:
International Journal of Parallel, Emergent and Distributed Systems
31 (2016), Nr. 6, 529–542. http://dx.doi.org/10.1080/17445760.2015.1118478. – DOI 10.1080/17445760.2015.1118478[9] B
AUER , M. ; S
ILVA , G. ; R ¨
UDE , U. : Truncation errors of the D3Q19lattice model for the lattice Boltzmann method. In:
Journal of Computa-tional Physics
405 (2020), 109111. http://dx.doi.org/https://doi.org/10.1016/j.jcp.2019.109111. – DOI https://doi.org/10.1016/j.jcp.2019.109111.– ISSN 0021–9991[10] B ¨
OSCH , F. ; C
HIKATAMARLA , S. S. ; K
ARLIN , I. V.: Entropicmultirelaxation lattice Boltzmann models for turbulent flows. In:
PhysicalReview E - Statistical, Nonlinear, and Soft Matter Physics (2015).http://dx.doi.org/10.1103/PhysRevE.92.043309. – DOI 10.1103/Phys-RevE.92.043309. – ISSN 15502376[11] D’H
UMI ` ERES , D. ; G
INZBURG , I. ; K
RAFCZYK , M. ; L
ALLEMAND ,P. ; L UO , L.-S. : Multiple-relaxation-time lattice Boltzmann models inthree dimensions. In: Philosophical transactions. Series A, Mathematical,physical, and engineering sciences
360 (2002), Nr. 1792, S. 437–451.http://dx.doi.org/10.1098/rsta.2001.0955. – DOI 10.1098/rsta.2001.0955.– ISBN 1364503X[12] D ¨
UNWEG , B. ; S
CHILLER , U. D. ; L
ADD , A. J. C.: Statistical mechanicsof the fluctuating lattice Boltzmann equation. In:
Physical ReviewE - Statistical, Nonlinear, and Soft Matter Physics
76 (2007), Nr.3, S. 1–10. http://dx.doi.org/10.1103/PhysRevE.76.036704. – DOI10.1103/PhysRevE.76.036704. – ISSN 1539375513] E
DWARDS , H. C. ; T
ROTT , C. R. ; S
UNDERLAND , D. : Kokkos: Enablingmanycore performance portability through polymorphic memory accesspatterns. In:
Journal of Parallel and Distributed Computing
74 (2014),Nr. 12, S. 3202–3216[14]
ELBE : . – accessed on 2020-01-21[15] F ARBER , R. :
Parallel programming with OpenACC . Cambridge :Morgan Kaufmann, 2017[16] F
EICHTINGER , C. ; D
ONATH , S. ; K ¨
OSTLER , H. ; G ¨
OTZ , J. ; R ¨
UDE ,U. : WaLBerla: HPC software design for computational engineeringsimulations. In:
Journal of Computational Science
EICHTINGER , C. ; H
ABICH , J. ; K ¨
OSTLER , H. ; R ¨
UDE , U. ; A
OKI , T. :Performance modeling and analysis of heterogeneous lattice Boltzmannsimulations on CPU–GPU clusters. In:
Parallel Computing
46 (2015), S.1–13[18] F
RAPOLLI , N. ; K
ARLIN , I. V.: Entropic Lattice Boltzmann Modelsfor Thermal and Compressible Flows. (2017). http://dx.doi.org/10.3929/ethz-a-010782581. – DOI 10.3929/ethz–a–010782581. ISBN8610828378018[19] G
EIER , M. ; S CH ¨ ONHERR , M. : Esoteric Twist: An Efficient in-PlaceStreaming Algorithmus for the Lattice Boltzmann Method on MassivelyParallel Hardware. In:
Computation
EIER , M. ; S CH ¨ ONHERR , M. ; P
ASQUALI , A. ; K
RAFCZYK , M.: The cumulant lattice Boltzmann equation in three dimensions:Theory and validation. In:
Computers and Mathematics with Appli-cations (2015). http://dx.doi.org/10.1016/j.camwa.2015.05.001. – DOI10.1016/j.camwa.2015.05.001. – ISSN 08981221[21] G
INZBURG , I. ; V
ERHAEGHE , F. ; D ’H UMIERES , D. : Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressureand mixed boundary conditions. In:
Communications in ComputationalPhysics
ODENSCHWAGER , C. ; S
CHORNBAUM , F. ; B
AUER , M. ; K ¨
OSTLER , H.; R ¨
UDE , U. : A framework for hybrid parallel flow simulations with atrillion cells in complex geometries. In:
Proceedings of the InternationalConference on High Performance Computing, Networking, Storage andAnalysis
ACM, 2013, S. 35[23] G
ODENSCHWAGER , C. ; S
CHORNBAUM , F. ; B
AUER , M. ; K ¨
OSTLER ,H. ; R ¨
UDE , U. : A Framework for Hybrid Parallel Flow Simulationswith a Trillion Cells in Complex Geometries. In:
Proceedings of theInternational Conference on High Performance Computing, Networking,Storage and Analysis . New York, NY, USA : ACM, 2013 (SC ’13). –ISBN 978–1–4503–2378–9, S. 35:1–35:12[24] G
ROEN , D. ; H
ENRICH , O. ; J
ANOSCHEK , F. ; C
OVENEY , P. ; H
ARTING ,J. : Lattice-Boltzmann methods in fluid dynamics: Turbulence andcomplex colloidal fluids. In:
J¨ulich Blue Gene/P Extreme ScalingWorkshop , 2011, S. 17[25] G
ROEN , D. ; H
ETHERINGTON , J. ; C
ARVER , H. B. ; N
ASH , R. W. ;B
ERNABEU , M. O. ; C
OVENEY , P. V.: Analysing and modelling theperformance of the HemeLB lattice-Boltzmann simulation environment.In:
Journal of Computational Science UO , Z. ; Z HENG , C. ; S HI , B. : Discrete lattice effects on the forcingterm in the lattice Boltzmann method. In: Physical Review E
65 (2002),Nr. 4, S. 046308[27] G
YSI , T. ; O
SUNA , C. ; F
UHRER , O. ; B
IANCO , M. ; S
CHULTHESS ,T. C.: STELLA: a domain-specific tool for structured grid methods inweather and climate models. In:
Proc. of the International Conferencefor High Performance Computing, Networking, Storage and Analysis
ACM, 2015, S. 41[28] H E , X. ; L UO , L.-S. : Lattice Boltzmann Model for the IncompressibleNavier–Stokes Equation. In: Journal of Statistical Physics
88 (1997), Nr.3/4, 927–944. http://dx.doi.org/10.1023/B:JOSS.0000015179.12689.e4. –DOI 10.1023/B:JOSS.0000015179.12689.e4. – ISBN 0022–4715[29] H
ENRETTY , T. ; V
ERAS , R. ; F
RANCHETTI , F. ; P
OUCHET , L.-N. ;R
AMANUJAM , J. ; S
ADAYAPPAN , P. : A stencil compiler for short-vectorSIMD architectures. In:
Proc. of the 27th international ACM conferenceon International conference on supercomputing
ACM, 2013, S. 13–24[30] H
EUVELINE , V. ; L
ATT , J. : THE OPENLB PROJECT: AN OPENSOURCE AND OBJECT ORIENTED IMPLEMENTATION OF LAT-TICE BOLTZMANN METHODS. In:
International Journal of Modern Physics C
18 (2007), Nr. 04, S. 627–634. http://dx.doi.org/10.1142/S0129183107010875. – DOI 10.1142/S0129183107010875[31] H
OLEWINSKI , J. ; P
OUCHET , L.-N. ; S
ADAYAPPAN , P. : High-performance code generation for stencil computations on GPU ar-chitectures. In:
Proc. of the 26th ACM international conference onSupercomputing
ACM, 2012, S. 311–320[32] H OU , S. ; S TERLING , J. ; C
HEN , S. ; D
OOLEN , G. D.: A LatticeBoltzmann Subgrid Model for High Reynolds Number Flows. In:
Fields Institute Communications
LUYVER , T. ; R
AGAN -K ELLEY , B. ; P ´
EREZ , F. ; G
RANGER , B. ;B
USSONNIER , M. ; F
REDERIC , J. ; K
ELLEY , K. ; H
AMRICK , J. ;G
ROUT , J. ; C
ORLAY , S. ; I
VANOV , P. ; A
VILA , D. ; A
BDALLA , S. ;W
ILLING , C. : Jupyter Notebooks – a publishing format for reproduciblecomputational workflows. In: L
OIZIDES , F. (Hrsg.) ; S
CHMIDT , B. (Hrsg.); IOS Press (Veranst.):
Positioning and Power in Academic Publishing:Players, Agents and Agendas
IOS Press, 2016, S. 87 – 90[34] K R ¨ UGER , T. ; K
USUMAATMAJA , H. ; K
UZMIN , A. ; S
HARDT , O. ;S
ILVA , G. ; V
IGGEN , E. M.:
The Lattice Bolzmann Method - Principlesand Practics . Springer, 2017. – ISBN 978–3–319–44649–3[35] K R ¨ UGER , T. ; V
ARNIK , F. ; R
AABE , D. : Shear stress in lattice Boltzmannsimulations. In:
Physical Review E - Statistical, Nonlinear, and SoftMatter Physics
79 (2009), Nr. 4, S. 1–14. http://dx.doi.org/10.1103/PhysRevE.79.046704. – DOI 10.1103/PhysRevE.79.046704. – ISBN1539–3755[36] L
AGRAVA , D. ; M
ALASPINAS , O. ; L
ATT , J. ; C
HOPARD , B. : Advancesin multi-domain lattice Boltzmann grid refinement. In:
Journal ofComputational Physics
231 (2012), Nr. 14, S. 4808 – 4822. http://dx.doi.org/10.1016/j.jcp.2012.03.015. – DOI 10.1016/j.jcp.2012.03.015.– ISSN 0021–9991[37] LB3D: http://ccs.chem.ucl.ac.uk/lb3d . – accessed on 2020-01-21[38] L
ENGAUER , C. ; A
PEL , S. ; B
OLTEN , M. ; C
HIBA , S. ; R ¨
UDE , U. ;T
EICH , J. ; G R ¨ OSSLINGER , A. ; H
ANNIG , F. ; K ¨
OSTLER , H. ; C
LAUS , L.u. a.: ExaStencils–Advanced Multigrid Solver Generation. In:
Softwarefor Exascale Computing-SPPEXA 2nd phase (2020)[39] L IU , Z. ; C HU , X. ; L V , X. ; M ENG , H. ; S HI , S. ; H AN , W. ; X U ,J. ; F U , H. ; Y ANG , G. : Sunwaylb: Enabling extreme-scale latticeboltzmann method based computing fluid dynamics simulations onsunway taihulight. In:
IEEE, 2019, S. 557–566[40] L
OGG , A. ; M
ARDAL , K.-A. ; W
ELLS , G. :
Automated solution ofdifferential equations by the finite element method: The FEniCS book .Bd. 84. Springer Science & Business Media, 2012[41] M
IERKE , D. ; J
ANSSEN , C. ; R
UNG , T. : An efficient algorithm for the cal-culation of sub-grid distances for higher-order LBM boundary conditionsin a GPU simulation environment. In:
Computers & Mathematics withApplications. (2018). http://dx.doi.org/10.1016/j.camwa.2018.04.022. –DOI 10.1016/j.camwa.2018.04.022. – ISSN 0898–1221[42] O
PEN
LB: . – accessed on 2020-01-21[43] P
ALABOS : . – accessed on 2020-01-21[44] P OHL , T. ; K
OWARSCHIK , M. ; W
ILKE , J. ; I
GLBERGER , K. ; R ¨
UDE , U.: Optimization and profiling of the cache performance of parallel latticeBoltzmann codes. In:
Parallel Processing Letters
13 (2003), Nr. 04, S.549–560[45] R
ATHGEBER , F. ; H AM , D. A. ; M ITCHELL , L. ; L
ANGE , M. ; L
UPORINI ,F. ; M C R AE , A. T. ; B ERCEA , G.-T. ; M
ARKALL , G. R. ; K
ELLY ,P. H.: Firedrake: automating the finite element method by composingabstractions. In:
ACM Transactions on Mathematical Software (TOMS)
43 (2016), Nr. 3, S. 1–27[46] R
OMAN , S. : The formula of Faa di Bruno. In:
The AmericanMathematical Monthly
87 (1980), Nr. 10, S. 805–809[47] R ¨
UDE , U. ; W
ILLCOX , K. ; M C I NNES , L. C. ; S
TERCK , H. D.: Researchand education in computational science and engineering. In:
Siam Review
60 (2018), Nr. 3, S. 707–754[48] S
AILFISH : https://github.com/sailfish-team/sailfish . – accessed on 2020-01-21[49] S CHILLER { }
Documents/50/thesis[50] S
CHMIESCHEK , S. ; S
HAMARDIN , L. ; F
RIJTERS , S. ; K R ¨ UGER , T.; S
CHILLER , U. D. ; H
ARTING , J. ; C
OVENEY , P. V.: LB3D: Aparallel implementation of the Lattice-Boltzmann method for simulationof interacting amphiphilic fluids. In:
Computer Physics Communications
17 (2017), S. 149–161. http://dx.doi.org/10.1016/j.cpc.2017.03.013. –DOI 10.1016/j.cpc.2017.03.013[51] S
CHORNBAUM , F. ; R ¨
UDE , U. : Massively parallel algorithms for thelattice Boltzmann method on nonuniform grids. In:
SIAM Journal onScientific Computing
38 (2016), Nr. 2, S. C96–C126[52] S
CHORNBAUM , F. ; R ¨
UDE , U. : Extreme-scale block-structured adaptivemesh refinement. In:
SIAM Journal on Scientific Computing
40 (2018),Nr. 3, S. C358–C387[53] S
ILVA , G. ; S
EMIAO , V. : First-and second-order forcing expansionsin a lattice Boltzmann method reproducing isothermal hydrodynamicsin artificial compressibility form. In:
Journal of fluid mechanics
ANG , Y. ; C
HOWDHURY , R. A. ; K
USZMAUL , B. C. ; L UK , C.-K.; L EISERSON , C. E.: The pochoir stencil compiler. In:
Proc. of thetwenty-third annual ACM symposium on Parallelism in algorithms andarchitectures
ACM, 2011, S. 117–128[55] TCLB: https://github.com/CFD-GO/TCLB . – accessed on 2020-01-21[56] T
REIBIG , J. ; H
AGER , G. ; W
ELLEIN , G. : Likwid: A lightweightperformance-oriented tool suite for x86 multicore environments. In:
IEEE, 2010, S. 207–216[57] W
ELLEIN , G. ; L
AMMERS , P. ; H
AGER , G. ; D
ONATH , S. ; Z
EISER , T.: Towards optimal performance for lattice Boltzmann applications onterascale computers. In:
Parallel Computational Fluid Dynamics 2005 .Elsevier, 2006, S. 31–40[58] W
ELLEIN , G. ; Z
EISER , T. ; H
AGER , G. ; D
ONATH , S. : On the singleprocessor performance of simple lattice Boltzmann kernels. In:
Computersand Fluids
35 (2006), Nr. 8-9, S. 910–919. http://dx.doi.org/10.1016/j.compfluid.2005.02.008. – DOI 10.1016/j.compfluid.2005.02.008. – ISBN0045–7930[59] W
ILKE , J. ; P
OHL , T. ; K
OWARSCHIK , M. ; R ¨
UDE , U. : Cacheperformance optimizations for parallel lattice Boltzmann codes. In:
European Conference on Parallel Processing
Springer, 2003, S. 441–450[60] W
ITTMANN , M. ; H
AAG , V. ; Z
EISER , T. ; K ¨
OSTLER , H. ; W
ELLEIN ,G. : Lattice Boltzmann benchmark kernels as a testbed for performanceanalysis. In:
Computers & Fluids
172 (2018), S. 582–592[61] W
ITTMANN , M. ; Z
EISER , T. ; H
AGER , G. ; W
ELLEIN , G. : Com-parison of different propagation steps for lattice Boltzmann methods.In: