[PDF] lbmpy: Automatic code generation for efficient parallel lattice Boltzmann methods

Abstract

Lattice Boltzmann methods are a popular mesoscopic alternative to macroscopic computational fluid dynamics solvers. Many variants have been developed that vary in complexity, accuracy, and computational cost. Extensions are available to simulate multi-phase, multi-component, turbulent, or non-Newtonian flows. In this work we present lbmpy, a code generation package that supports a wide variety of different methods and provides a generic development environment for new schemes as well. A high-level domain-specific language allows the user to formulate, extend and test various lattice Boltzmann schemes. The method specification is represented in a symbolic intermediate representation. Transformations that operate on this intermediate representation optimize and parallelize the method, yielding highly efficient lattice Boltzmann compute kernels not only for single- and two-relaxation-time schemes but also for multi-relaxation-time, cumulant, and entropically stabilized methods. An integration into the HPC framework waLBerla makes massively parallel, distributed simulations possible, which is demonstrated through scaling experiments on the SuperMUC-NG supercomputing system

Full PDF

llbmpy: Automatic code generation for efﬁcientparallel lattice Boltzmann methods

Martin Bauer

Chair for System Simulation,FAU Erlangen-N¨urnberg [email protected]

Harald K¨ostler

Chair for System Simulation,FAU Erlangen-N¨urnberg [email protected]

Ulrich R¨ude

Chair for System Simulation,FAU Erlangen-N¨urnbergCERFACS, 31057 Toulouse Cedex 1, France [email protected]

Abstract —Lattice Boltzmann methods are a popular mesoscopicalternative to classical computational ﬂuid dynamics based on themacroscopic equations of continuum mechanics. Many variantsof lattice Boltzmann methods have been developed that vary incomplexity, accuracy, and computational cost. Extensions areavailable to simulate multi-phase, multi-component, turbulent,and non-Newtonian ﬂows. In this work we present lbmpy , a codegeneration package that supports a wide variety of different lat-tice Boltzmann methods. Additionally, lbmpy provides a genericdevelopment environment for new schemes. A high-level domain-speciﬁc language allows the user to formulate, extend and test var-ious lattice Boltzmann methods. In all cases, the lattice Boltzmannmethod can be speciﬁed in symbolic form. Transformations thatoperate on this symbolic representation yield highly efﬁcient com-pute kernels. This is achieved by automatically parallelizing themethods, and by various application-speciﬁc automatized stepsthat optimize the resulting code. This pipeline of transformationscan be applied to a wide range of lattice Boltzmann variants, in-cluding single- and two-relaxation-time schemes, multi-relaxation-time methods, as well as the more advanced cumulant methods,and entropically stabilized methods. lbmpy can be integrated intohigh-performance computing frameworks to enable massivelyparallel, distributed simulations. This is demonstrated using the WA LB ERLA multiphysics package to conduct scaling experimentson the SuperMUC-NG supercomputing system on up to 147 456compute cores.

I. I

NTRODUCTION

Computational science and engineering is an interdisciplinaryﬁeld. The workﬂow of creating computational models starts at(physical) reality and ends with the production of efﬁcientlyexecutable computer code [47]. However, on the route fromphysical phenomena to machine code lies the formulation ofmathematical models and their discretization, the constructionof time stepping schemes and solution methods, the designand analysis of parallel algorithms, the realization of complexsoftware systems, and ﬁnally the transformation to code thatcan be executed on a given hardware. The target computerarchitecture may be a massively parallel system, possiblyheterogeneous and using accelerators that can only be exploitedby special programming techniques. During the development,many choices must be taken and alternatives considered. Thus,creating computational science software is a work-intensive,time-consuming, and error prone task, whose complexity iseasily underestimated, despite its fundamental relevance forextracting reliable predictions from scientiﬁc principles. In thisarticle, we will present progress towards the systematic design of scientiﬁc software based on the automatic derivation ofmethods including the automatic design of efﬁcient software.When designing scientiﬁc software with conventional program-ming techniques, it is often difﬁcult to ﬁnd the right balancebetween the ﬂexibility of an approach and its performance,since these are often conﬂicting goals. Often the specializationto a restricted class of problems would permit special optimiza-tions and can thus lead to more efﬁcient codes. However, aﬂexible software design may lead to more extensible and moregenerally usable software. Additionally, using general-purposelibraries for subtasks, such as for the solution of linear systems,may reduce development time, but may also lead to overheadswhen special structures that would lead to faster algorithmscan not be exploited.Furthermore, it is typical that computational software outlivesthe computer systems that it was originally designed for. Thisleads to the problem of performance portability. The choice of aspeciﬁc algorithm and a speciﬁc software design may have beengood for the efﬁcient execution on older computer architectures,but these design choices may turn out to be a major bottleneckon the accelerator-based architectures that dominate high-endcomputing today. A complete rewrite would be necessary, butsince this is too time-consuming and expensive. Legacy codetoday may stay in use though it severely underperforms onmodern hardware.A generic approach to overcome these difﬁculties are more ad-vanced abstractions. For example, libraries such as Kokkos [13]or programming systems such as OpenACC [15] presentabstractions of ﬁne-granular concurrent execution on modernhardware that can help to alleviate the problems of performanceportability. An alternative to such libraries can be metaprogram-ming techniques and the usage of domain speciﬁc languages(DSL). Here machine optimized code is generated utilizingprogram transformations and compiler technology. Using DSLsopens additional possibilities based on application-speciﬁcabstractions. A prominent and successful example for a ﬁniteelement speciﬁc DSL is UFL (uniﬁed form language) [1]which is embedded in python. It is e.g. used in FeniCS [40] orFiredrake [45]. These automated computing platforms permitthe generation of computational models based on a wide varietyof partial differential equations that can be expressed in aDSL. For stencil based computations there exist a numberof different DSLs e.g. [54], [31], [29], [27] that work on a r X i v : . [ c s . M S ] A p r tructured grids and e.g. the ExaSlang DSL that can alsowork on block-structured grids [38]. These DSLs succeed insupporting a wide variety of mathematical models with highﬂexibility. Additionally, most of them also support parallelprogram execution. Such approaches can exploit applicationknowledge represented in the DSL design and can thus useoptimization techniques that are more powerful than optimizingcompilers for general-purpose languages.In the present article, we will focus alternatively on thelattice Boltzmann method (LBM), as a promising mesoscopicmodeling paradigm. Our scope of modeling thus supportskinetic schemes as an alternative to continuum mechanics forﬂuid dynamics. Kinetic schemes are often performance-hungry,but they also offer a high degree of parallelism. Therefore thescalability and performance on modern computer architecturesis a central goal of our work. In particular, we will attemptto reach the performance of the best manually optimizedLBM codes by developing a new application-speciﬁc codegeneration technology. To achieve this, we will leverage longterm experience in manually optimizing general stencil codesand systematic performance engineering for LBM methods[59], [57], [58], [22], [17]. In particular, we will build on WA LB ERLA as a state-of-the-art LBM software frameworkwith excellent performance characteristics [5], [39].However, the current article goes beyond developing toolsthat help to optimize given kinetic simulation algorithms for avariety of computer architectures. While this alone is also useful,we here extend the code transformation approach to the earlierstages of the computational science workﬂow. In particular, wepoint out that the design and derivation of lattice Boltzmannmodels follows a complex but systematic methodology. Thedevelopment of a speciﬁc LBM for a given application ischaracterized by various options and choices that the methoddesigner must make. This involves selecting a lattice model,deﬁning momenta and relaxation rates, and many more, aswill be elaborated in detail below. Based on these choices, avery wide variety of LBMs can be derived. While any suchmodel could be constructed manually and then optimized usingprogram generation technology, this article goes an essentialstep further. The manual development of advanced LBM canalso be time-consuming and error-prone. We will presenthere, how the derivation of the models themselves can beconceptualized so that it becomes amenable to automation. In lbmpy , the tedious mechanical steps of the LBM developmentcan be performed by automatic symbolic manipulations, savingprecious developer time and making the methods more reliableand reproducible.In summary, lbmpy jointly with WA LB ERLA becomes acomputing platform for LBMs that is equivalent to whatFEniCS is for ﬁnite elements. lbmpy helps to realize highlyefﬁcient LBM implementations and makes it easy for thedeveloper to experiment with different variants of the methods.Fully grown implementations of many different LBMs canbe generated with a single mouse click. We emphasize herespeciﬁcally that our code generation technology has the uniquecapability to generate highly efﬁcient parallel code that is ready to run with optimal scalability on the largest supercomputers.The LBM is based on concepts from statistical mechanics. Theﬂuid is modeled on the mesoscopic level using distributionfunctions that represent the statistical behavior of the particlesconstituting the ﬂuid. Compared to traditional computationalﬂuid dynamics (CFD), that describes the ﬂuid macroscopicallywith the Navier-Stokes equations, the mesoscopic descriptionpermits higher modeling expressivity, leading to a variety ofLBMs, e.g., for porous media or multi-phase ﬂows. Its localdata access pattern makes the method very well suited formodern hardware architectures that draw their computationalpower of ever-increasing concurrency. To run code efﬁcientlyon these architectures, parallelism on different levels must beexploited, starting from single-instruction-multiple-data (SIMD)instruction sets, utilizing multiple cores per node with OpenMP,to distributed memory parallelism with MPI. Optimizing LBcompute kernels to get the best possible performance requireshardware-speciﬁc adaptation. This process costs signiﬁcantdevelopment time. Unfortunately, it must be repeated for eachnew hardware platform. Furthermore, the optimization processoften leads to code that is hard to read and maintain. In practice,this can lead to the effect that for prototyping and development,a slow but ﬂexible code is used, and only a few proven methodsare re-implemented in a highly optimized way.Over time a large variety of LBMs have been proposed [34].Starting from the simple, but widely used, BGK single-relaxation-time (SRT) operator that relaxes the current statelinearly to equilibrium with a single relaxation rate, overtwo-relaxation-time methods (TRT) [21] to general multi-relaxation-time (MRT) [11] methods. All these methods relaxto equilibrium in moment space and can be viewed as specialcases of MRT. Then, there are multiple advanced methods likecumulant [20] or entropic LBMs [10], where a different setof statistical quantities is chosen for relaxation or the discreteequilibrium is modiﬁed. All LB versions have in commonthat they are parameterized by a set of relaxation rates, whichcan either be chosen constant or adapted locally. Computingrelaxation parameters from local quantities, e.g., shear rates,is used to implement turbulence models or to simulate non-Newtonian ﬂuids. Entropic KBC-type models [10], [2], [18]can also be placed into this group since they are constructedbased on an MRT method with two relaxation rates, where onerelaxation rate is chosen subject to a local entropy condition.Additionally, LBMs could be extended with different forcingschemes [53], [26], and the equilibrium could be approximatedup to different orders either in the so-called incompressible [28]or the standard compressible version. During the implemen-tation phase, the space of options is growing still. Differentstorage patterns for the distribution functions can be chosen [3],[19]. Hardware-dependent optimizations, like loop splitting andnon-temporal stores, further increase the performance on recentCPU architectures.Faced with this space of physical models and implemen-tation/optimization options, developers of LB frameworksare faced with two options: Either pick a small set of LBmethods and optimize them for a speciﬁc target architecture,epeating the full process when a new LB method or hardwareplatform must be supported or, trying to abstract and automatethe development task. In this work, we show that it isindeed possible to automate many tedious development tasks,like reformulating equations to save ﬂoating-point operations,splitting loops for better memory access behavior, or fusingstream and collision kernels. We present the lattice Boltzmanncode generation package lbmpy that solves these problemsby automating large parts of the LBM development andoptimization process. Its source code and documentation areavailable open-source under the GNU AGPLv3 license . lbmpy is a system for development of computational ﬂuiddynamics codes based on the LBM. In this regard, it iscomparable to other LBM frameworks such as OpenLB [30],[42], Palabos [36], [43], elbe [41], [14], LB3D [24], [50],[37], and HemeLB [25]. Some of these LB frameworkslike Sailﬁsh [48] and TCLB [55] also use metaprogrammingtechniques, mainly for portability to GPUs. While theseframeworks support many LB methods, systematic performanceevaluations and optimizations are predominantly available forsimple LB collision operators like single- and two-relaxation-time methods [60], [61], [58], [22].In this work, we ﬁrst give an overview over the speciﬁcLBM design workﬂow and the associated code generationpipeline in section II. Then we describe the formalism forLB method speciﬁcation in section III that comprises the twoupper abstraction layers. The transformation to an algorithmicdescription and its optimization is discussed in section IV.Finally, we present performance and scaling results in section V.II. O VERVIEW : LB M

ETAPROGRAMMING P IPELINE

In this section, we ﬁrst give an overview of the variousabstraction layers of lbmpy ’s metaprogramming pipeline. Inthe following sections, we then discuss each layer in detail.All abstraction layers are implemented based on the computeralgebra system sympy . The code generation system itself isimplemented in Python, but the generated code is produced inC/C++/LLVM for CPUs and in CUDA or OpenCL for GPUs.As illustrated in ﬁg. 1 the most abstract layer represents aLBM in q -dimensional collision space for a DdQq stencil.This collision space is either based on moments or cumulants[20]. To specify an LB scheme, the user deﬁnes a basis of thecollision space as well as an equilibrium value and a relaxationparameter for each component. Relaxation parameters do nothave to be constants but can be chosen as a symbolic expressionthat depends on local quantities like shear rates. This permitsthe formulation of models for non-Newtonian ﬂuids, turbulencemodels, or entropic stabilization. lbmpy transforms this high-level representation into a symbolicdescription of the collision operator. The collision operator isstored as a symbolic function Ω : R q → R q , where q is thenumber of discrete distribution functions.In the next step, the collision operator is mapped to acomputational kernel. This stage allows the generation of purecollision kernels as well as fused stream-collide kernels. https://i10git.cs.fau.de/pycodegen/lbmpy Figure 1: Abstraction layers of metaprogramming pipeline.Additionally, a concrete storage pattern for the distributionfunctions is selected, e.g., two-array push or pull patterns ormore elaborate single array storage schemes. Optionally onecan integrate boundary handling or output of macroscopicquantities at this layer as well. The output of this stage is asymbolic stencil representation of an LB compute kernel. Thisstencil representation is passed to the pystencils package [7]that produces the actual code for CPUs or GPUs. pystencils isextended with custom optimization passes to extend the code-generation pass with domain-speciﬁc knowledge. In this workwe focus on the CPU backend, GPU-speciﬁc optimizationsand performance results are part of a second publication.III. M

ODEL DESCRIPTION IN COLLISION SPACE

In this section we ﬁrst give a short overview over the theory thatis used to deﬁne LBMs on the highest abstraction level. Thenwe describe the “LB method layer” in more detail followedby examples showing how common collision operators can bespeciﬁed on this layer.

A. Collision in moment space

All LBMs considered here discretize the computational domainusing a Cartesian, uniformly spaced grid. The discretizationis speciﬁed in stencil notation as

DdQq , describing a d -dimensional domain where each cell contains q particle distri-bution functions (PDFs) labeled f q ( x i , t ) with q ∈ { , ..., q } .The PDF f q represents the mass fraction of particles movingalong a lattice velocity c q . In the following, we use latticeunits, i.e., the positions x i and the time t are integers.The most basic and probably also the still most commonly usedcollision operator for LBMs is the single-relaxation-time (SRT)r BGK operator. It relaxes each population to its equilibriumdistribution using a single relaxation parameter ω . The SRTLBM can be succinctly written as f q ( x + c q , t + 1) = ω f ( eq ) q ( x , t ) + (1 − ω ) f q ( x , t ) , (1)describing a single time step of the LBM, evolving the systemfrom time t to t + 1 . It can be split into a local collision stepthat computes a convex combination of the current state withthe equilibrium state and a non-local streaming step that copiesPDFs to neighboring cells. The collision is formulated using arelaxation rate ω which is the inverse of the relaxation time τ ,i.e., ω = 1 /τ . The local density ρ and velocity u are computedfrom the distribution function as ρ = (cid:88) q f q and u i = 1 ρ (cid:88) q c qi f q . (2)With these macroscopic values the equilibrium is given as f ( eq ) q = w q ρ + w q ρ (cid:2) c qα u α c s + 12 c s u α u β ( c qα c qβ − c s δ αβ ) (cid:124) (cid:123)(cid:122) (cid:125) + 12 c s (cid:0) u α u β u γ ( c qα c qβ c qγ − c s c qα δ βγ ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) (cid:3) . (3)The reference density ρ can either be chosen as to obtain aso-called incompressible LBM [28], for standard compressibleLBM set ρ = ρ . Note that both versions approximate theincompressible Navier-Stokes equations (NSE) [34]. Equation(3) shows a third-order equilibrium approximation. To obtainthe NSE in the macroscopic limit, the equilibrium is requiredonly up to second order in u [34].While the BGK collision operator is still widely used in practice,we aim for a more generic description of LBMs. To develop ahigh-level description of LBMs that is used as the input to themetaprogramming pipeline, we rely on the formalism of multi-relaxation-time methods. It includes the BGK operator andthe popular two-relaxation-time method as special cases. Withthis approach lbmpy is able to generate all LB schemes whosecollision operator can be written in linear matrix form. Thiscovers the majority of LBMs, with the most notable exceptionbeing the cumulant collision operator that will be treated insection III-B.The MRT formalism also allows us to derive the discreteequilibrium (3) from its continuous counterpart instead ofmanually specifying it. The collision operator of MRT methodsﬁrst transforms the PDFs from population space into momentspace via a moment matrix M . The components of this momentvector m = M f are relaxed to equilibrium values m ( eq ) usinga diagonal matrix of relaxation rates S . One collide-streamstep of an MRT method then reads f q ( x + c q , t + 1) = M − (cid:104) S m ( eq ) + ( I − S ) M f (cid:105) (4)with I denoting the identity matrix. To fully specify the method,we have to deﬁne a concrete moment matrix M , a vector of corresponding equilibrium values m ( eq ) , and a diagonalrelaxation matrix S . We now discuss how each of these threeingredients can be speciﬁed in lbmpy .

1) Moment Space:

We begin with the transformation frompopulation space to moment space via the moment matrix M . To derive an invertible transformation M , a set of q independent moments is required. Moments can be identiﬁedwith polynomials in the lattice velocities P ( c q ) : R d → R .For example, the zeroth moment, i.e., the density, is given bythe constant polynomial . The ﬁrst moments, i.e., the x, y,and z momentum densities, are represented by c qx , c qy , and c qz . Second order polynomials describe viscosity modes, forexample in 2D, c qx + c qy is a mode related to bulk viscosity,whereas c qx · c qy and c qx − c qy are modes related to shearviscosity. The moment value is then computed as Π P ( f ) = (cid:88) c q ∈S P ( c q ) f q (5)with a stencil S that is given by a sequence of directions withinteger components. Given a sequence of moment polynomials P ... P k the elements of the moment matrix are computed as M kq = P k ( c q ) f q . So we have to select q moment polynomialsthat yield an invertible moment matrix M . In lbmpy , thereare various options to provide these moment polynomials. Themost basic but also most ﬂexible option is to list them explicitly.Then lbmpy automatically computes the moment matrix andchecks that it is invertible. This option is useful if code for agiven MRT method from literature has to be generated.One central goal of lbmpy is to derive LB methods automat-ically, and not having to pass in, for example, the discreteequilibrium or even the moment basis. Thus, our systemadditionally offers routines to construct the moment basis forﬁrst neighborhood stencils automatically. First, these routineshave to ﬁnd q monomials that lead to an invertible momentmatrix M , which can then be orthogonalized in a second step.To illustrate this procedure, we ﬁrst consider the D3Q27 stencil.Since the velocity vector components only contain the values {− , , } , moments with velocity powers larger than aliasa lower order moment. For example (cid:88) c q ∈S c qi f q = (cid:88) c q ∈S c qi f q if c qi ∈ {− , , } . (6)Similarly, moments with even exponents larger or equalthan 4 are aliases by corresponding moments with exponent2. Potentially non-aliased moments are thus c e q c e q c e q with e i ∈ { , , } . These systematically constructed monomialmoments can be used for the D3Q27 stencil to construct aninvertible moment matrix. Similarly, in 2D, this strategy alsoyields an invertible moment matrix for the D2Q9 stencil. ForD3Q15 and D3Q19 the situation is slightly more complex. ForD3Q19, we start with the possible monomial moments anddiscard moments that produce a zero line in the moment matrix,like e.g. the moments deﬁned by the polynomials c q c q c q or c q c q c q . For this stencil, there are in total out of the possible monomial moments leading to zero lines, leaving independent rows. For the D3Q15 stencil this procedureas to be further reﬁned. There, some monomial momentsproduce the same non-zero row in the moment matrix, whichare trivially linear dependent. lbmpy groups moments togetherthat yield the same row, resulting in 15 groups. One groupof moments, for example is [ c q c q , c q c q , c q c q c q ] . In eachgroup, we keep only the lowest order moments. If there ismore than one moment remaining, their sum is used. In caseof above example this leads to c q c q + c q c q . This systematicprocedure constructs moment matrices that span the same spaceas MRT matrices reported in [11], [12], [49].However, the constructed q independent moments are notorthogonal yet. For MRT methods, typically, an orthogonalmoment set is required. Using a symbolic Gram-Schmidtprocedure, lbmpy can orthogonalize the moments, eitherutilizing the standard scalar product, or a scalar productweighted by the lattice weights. The exact outcome of theGram-Schmidt orthogonalization depends on the ordering ofthe non-orthogonal moments that are put in. For reproducibleresults lbmpy sorts the input by moment order and within eachorder lexicographically. Before the orthogonalization, also thesecond-order moments are manually split into bulk and shearpart.

2) Equilibrium State:

The second element necessary for theconstruction of an MRT method is the equilibrium. It caneither be given in population space (3) or directly as a vectorof equilibrium moments m ( eq ) . In this mode, the user has fullcontrol over the equilibrium values and can create LBMs notonly for the Navier-Stokes equations but also for other partialdifferential equations.Beyond this, our meta-programming approach attempts toderive as much as possible from a more general formulation.Therefore, we provide functionality in lbmpy to derive equilib-rium values for hydrodynamic LBMs automatically. One wayto obtain a hydrodynamic discrete equilibrium is to computethe equilibrium moments directly from the continuous Maxwell-Boltzmann distribution f ( MB ) ( ρ, uuu, ξξξ ) = ρ (2 πc s ) D exp (cid:18) − || ξξξ − uuu || c s (cid:19) (7)with (cid:90) P ( ξξξ ) f ( MB ) ( ρ, uuu, ξξξ ) dξξξ. (8)In this case, the user ﬁrst chooses a value for the speed of sound c s , typically c s = 1 / √ for ﬁrst neighborhood stencils, thenthe integral (8) is evaluated symbolically with the help of sympy .The resulting continuous moments of the Maxwellian areused in the equilibrium moment vector m ( eq ) . Optionally themoments can be truncated to a given order in the macroscopicvelocity u . If the equilibrium is required in population space,it can be easily transformed using the assembled momentmatrix with M − m ( eq ) . For the cartesian product stencilsD2Q9 and D3Q27, this method yields exactly the standardequilibrium (3). For the D3Q15 and D3Q19 velocity sets,however, a different equilibrium is obtained. A comparison ofthe standard equilibrium and the equilibrium obtained with thismoment-matching technique can be found in [9]. The standard equilibrium for D3Q15 and D3Q19, including the weights, canalso be derived by lbmpy using a Hermite projection of (7).

3) Relaxation rates:

The third building block to fully deﬁnethe method are the relaxation parameters. In lbmpy , theuser can specify a relaxation rate separately for each of thepreviously selected moments. Each relaxation rate can eitherbe a constant value, a symbol, or an arbitrary expression oflocal or neighboring values. In the simplest case, the relaxationrate is a compile-time constant and equal for all time steps andlattice cells. This case allows the computer algebra system topre-evaluate and simplify expressions containing constants only,thus leading to signiﬁcant savings. If the relaxation parameteris chosen as a symbol, it becomes a run-time parameter of thegenerated kernel function. In this case it can be changed, e.g.,in a conﬁguration ﬁle of the ﬁnal application without requiringre-compilation. Of course this comes possibly at the cost ofexecuting more FLOPs as compared to the pre-evaluation. Thethird option, where the relaxation rate is given as a symbolicexpression, gives the most modeling power and ﬂexibility. Theexpression may contain any local or neighboring quantitieslike equilibrium or non-equilibrium moments. This allows theformulation of a wide range of turbulence models, where therelaxation rate needs to be adapted depending on shear rates.Entropically stabilized schemes like the KBC-type models [10]can also be described in this way. The relaxation rate expressionmay also contain values of other arrays, allowing for easycoupling of multiple LB schemes, e.g., for multiphase orthermal ﬂows.

B. Collision in cumulant space

Recently an alternative collision space has been proposedin [20]. Before collision, cumulants of the distribution func-tion are calculated that are relaxed against their respectiveequilibrium value. Conceptually, cumulant collision operatorsare realized in lbmpy similar to collision operators in momentspace. The user speciﬁes a set of cumulants, together withrelaxation rates. The cumulant equilibrium values are obtainedfrom the continuous Maxwellian. This allows the formulationof cumulant methods not only for D3Q27 and D2Q9 but alsofor D3Q19 and D3Q15 stencils.Cumulants can be succinctly deﬁned through the cumulant-generating function K ( ξξξ ) = ln  (cid:88) c q ∈S f q exp( c q · ξξξ )  . (9)The cumulants are computed by multi-differentiation of (9)and evaluating the derivative at zero. For example, the “bulkcumulant”, that we associated with the polynomical c qx + c qy is computed as ∂ K ( ξξξ ) ∂ξ (cid:12)(cid:12)(cid:12)(cid:12) ξξξ =0 + ∂ K ( ξξξ ) ∂ξ (cid:12)(cid:12)(cid:12)(cid:12) ξξξ =0 . (10)Originally, we implemented the cumulant transformation withthis approach in lbmpy , however, the resulting expressions getvery elaborate, especially for large stencils. sympy ’s commonubexpression evaluation capabilities then run an unpracticablelong time manipulating these expressions. Thus we havedeveloped an alternative multi-step transformation, where thepopulations are ﬁrst transformed to moment space and then tocumulants. The moments are intermediate quantities that serveas common subexpressions. In lbmpy , we use Fa`a di Bruno’sformula [46] to derive the transformation of raw moments tocumulants and vice versa. C. Collision Model Examples

In this section, we demonstrate how LBMs can be formu-lated in lbmpy by constructing collision operators of varyingcomplexity.

1) SRT, TRT:

We start with the single- and two-relaxation-timecollision operators. Even if these methods are typically notderived in moment space, we use the MRT formalism for theseoperators as well to not introduce special cases. The challengewith this general approach is, however, that the simpliﬁcationsystem needs to be able to reduce the resulting expressions totheir short form.The following code example shows the deﬁnition of a D2Q9TRT method. Stencils are represented by a tuple of discretedirections with integer components. Common stencils, like theD2Q9, can be obtained by their name. This stencil is used toconstruct a set of independent raw moments using the algorithmdescribed above. d2q9 = get_stencil("D2Q9")moments = independent_raw_moments(d2q9) ω _e, ω _o = symbols(" ω _e, ω _o") ω s = [ ω _e if is_even_moment(m) else ω _o for m in moments]m_eq = maxwellian_moments(moments, dim=2,c_s=1/sqrt(3))trt = create_method(d2q9, moments, ω s, m_eq) In this example, the moment equilibrium values are computedfrom the continuous Maxwellian, and the relaxation rates aredeﬁned for each moment. lbmpy offers various classiﬁcationfunctions for moments like the is_even_moment function,used here. Other functions can determine the order of a moment,or if it is related to shear or bulk viscosity. Putting theseelements together, the method is fully deﬁned and can bedisplayed to the user in a Jupyter notebook [33] in tabularform, as shown below.Moment Equilibrium Relaxation rate ρ ω e x ρu ω o y ρu ω o x ρu + ρ ω e y ρu + ρ ω e xy ρu u ω e x y ρu ω o xy ρu ω o x y ρu + ρu + ρ ω e For better readability we denote moment polynomials usingvariables x, y and z instead of c qx , c qy and c qz . Note that noexplicit equilibrium formulation similar to (3) was necessaryto construct this method. Only the stencil, the continuousMaxwellian, and a systematically constructed set of indepen-dent raw moments have been used to derive this method.

2) MRT:

Next, we show how to construct a generic MRTmethod in lbmpy . We stick with the D2Q9 stencil to keepthe listing of the method tableaus short. Similar to the TRTmethod above, we start with a set of independent raw moments.For MRT methods, the moments have to be orthogonalized.In lbmpy we provide an orthogonalization routine based onthe Gram-Schmidt procedure. This routine either uses thestandard or a weighted scalar product. A common choiceis to use a scalar product weighted with the lattice weights,which we demonstrate in the code example below. If wewant to control bulk and shear viscosities using differentrelaxation rates, the second-order moments must be mod-iﬁed before the orthogonalization. This is handled by the split_shear_bulk_moments function. We pass in thelist of all raw moments, containing the second-order moments x , y and xy . These are split into the bulk moment x + y andthe remaining xy and x − y moments. In 3D, this functionworks analogously. moments = independent_raw_moments(d2q9)moments = split_shear_bulk_moments(moments)moments = gram_schmidt(moments, d2q9,weights=get_weights(d2q9)) ω = symbols(" ω _:4") ω s = [0 if get_order(m) < 2 else ω [0] if is_shear_moment(m) else ω [1] if is_bulk_moment(m) else ω [get_order(m)-1] for m in moments]m_eq = maxwellian_moments(moments, dim=2,c_s=1/sqrt(3))mrt = create_method(d2q9, moments, ω s,to_incompressible(m_eq)) The Gram-Schmidt orthogonalization step then produces themoments listed in the ﬁrst column of the following table. Thena list is constructed that deﬁnes the relaxation rate for eachmoment. Moments of order less than two are conserved andthe relaxation rate can be chosen arbitrarily. In this example,the relaxation rate is set to zero for these moments. Havingsplit up the second order bulk and shear moments, we canpick separate relaxation rates ω and ω for these. In thisexample, we choose a common relaxation rate for the third-and fourth-order moments.Moment Equilibrium Relaxation rate ρ x u y u x − y u − u ω xy u u ω x + 3 y − u + 3 u ω x y − y ω xy − x ω x y − x − y + 1 0 ω ompared to the TRT example, we have done anothermodiﬁcation here. The equilibrium moments are modiﬁedto yield a so-called incompressible equilibrium [28]. Theincompressible equilibrium moments are obtained by writingthem as polynomial in the velocity u and substituting ρ = 1 in all terms that contain at least one velocity component, e.g., ρ + ρu → ρ + u .Having full information about an LB method in the form ofthe moment table, as shown above, enables us to analyzethe method using a Chapman-Enskog procedure symbolically,as long as relaxation rates are chosen constant. The primaryinput for this analysis are the moment equilibrium values. Theautomated analysis can show the user the approximated PDEas well as higher-order error terms. Additionally, it can derivethe connection between relaxation parameters and macroscopicparameters, e.g., viscosities. The following snippet shows theanalysis of the MRT method deﬁned here. >>> ce = ChapmanEnskogAnalysis(mrt) >>> ce.get_bulk_viscosity()-1/9 - 1/(3* ω _1) + 5/(9* ω _0) >>> ce.get_macroscopic_equations()[0] ∂ _t ρ + ∂ _0 u_0 + ∂ _1 u_1

3) Boundary Conditions:

Similar to the collision operator,boundary conditions are also described in symbolic form.Boundary conditions have to specify the value of a populationthat is streamed in from a boundary lattice cell. Here is anexample of a velocity-bounce-back boundary that models amoving wall. def vel_bounce_back(f, c, method, vel):c_s = method.speed_of_soundw_q = method.weights[method.stencil.idx(c)]vel_term = 2 / c_s**2 * c * v * w_q return f.center(c) - vel_term

In the boundary deﬁnition, the user has access to the methoddeﬁnition, that offers properties like speed of sound or latticeweights. The lattice direction c is an integer vector pointingfrom the ﬂuid to the boundary cell. With this information,an expression for the missing population is constructed. Thepopulation ﬁeld f and macroscopic properties can be accessedusing relative addressing, where the center is the ﬂuid cell.Additional information, like in this example, the velocity of themoving wall, can be used. This data can be supplied by varioussources. In the simplest case, it is a compile-time constant value.It can also be an expression that depends on spatial coordinates,time step, local population values, or macroscopic quantities.It can also be supplied at runtime. In this case the data isread from a ﬁeld or a sparse list data structure that stores thisinformation for every connected boundary cell. More detailswill be covered in the section on the algorithmic treatment ofboundary conditions. However, in all these cases, the boundarydeﬁnition, as shown in the above example, does not change atall. The deﬁnition and implementation are strictly separated.Boundary conditions are deﬁned per lattice link, not by latticecell. That means, for example, that for each link a differentvelocity can be prescribed.

4) Turbulence models:

Up to now, we have presented methodswith constant relaxation rates. The modeling power of LBMsstems partially from the ability to vary relaxation rates ona cell-by-cell basis, depending on local quantities. With thistechnique, one can for example, model non-Newtonian ﬂuidsor implement turbulence models. To provide this modelingpower to the user, lbmpy does not only allow for compile-and run-time constants as relaxation rates. It can also takearbitrary expressions of neighboring distribution functions ormacroscopic quantities as relaxation rates. We illustrate thisfor the example of a Smagorinsky subgrid turbulence model.This model adds an eddy viscosity ν t to represent energydamping on unresolved scales [32]. The eddy viscosity iscalculated from the local strain rate tensor as ν t = ( C S ∆) | S | (cid:124) (cid:123)(cid:122) (cid:125) ν t (11)where C S is a constant and ∆ is a ﬁlter length chosen as 1 inlattice coordinates. | S | = (cid:112) S ij S ij is the Frobenius norm ofthe local strain rate tensor S ij = 12 ( ∂ i u j + ∂ j u i ) = − ω ρ Π ( neq ) ij . (12)This uses the fortunate property of LBMs that the strain ratetensor can be computed from local quantities, only using thesecond non-equilibrium moment [35] Π ( neq ) ij = (cid:88) q c qi c qj (cid:16) f q − f ( eq ) q (cid:17) . (13)Equation (12) contains the total relaxation rate ω which iscomputed from the total viscosity ν = ν + ν t which againdepends on the eddy viscosity ν t that we want to determine ω = 26 C S | S | + 6 ν + 1 . (14)Thus we have a system of two equations in ω and | S | that wenow like to solve for ω . Here it pays off that lbmpy is basedon the computer algebra system sympy where these steps canbe performed automatically: S, ω = symbols("|S|, ω ", positive=True)f_neq = pre_collision_symbols() - equilibrium_symbols() Π = frobenius_norm(second_order_moment_matrix(f_neq))eqs = [ Eq( ω , ω _from_ ν ( ν _from_ ω ( ω _0) + C_S**2 * S )),Eq(S, 3 * ω / 2 * Π ) ]effective_ ω = solve(eqs, [ ω , S])[ ω ] The resulting symbolic expression for the effective ω valuecan be be used in all places where in previous examples aconstant has been used. Thus, one can construct MRT orcumulant methods where some or all relaxation rates varylocally, using potentially different expressions for differentrelaxation rates. For brevity, we have shown here only theconstruction of a simple turbulence model. The possibility toemploy arbitrary expressions as relaxation rates in lbmpy canbe used to realize also more advanced turbulence models withnly little programming effort. Additionally, lbmpy also comeswith several pre-deﬁned turbulence models. Thus the user doesnot have to perform the steps outline above manually, if onlya common turbulence model is required.

5) Entropic KBC Models:

Another important class of models,where relaxation rates are varied locally, are entropic LBschemes. In this section, we show how entropic MRT methods,labeled KBC models by the authors in [10], are realized in lbmpy . The central idea of these methods is to maximize adiscrete entropy measure S of the post-collision state. The freevariable that is tuned to obtain maximum entropy is a relaxationrate associated with higher-order moments. Single relaxationtime entropic methods also change the effective viscosity byvarying this single rate to maximize entropy. KBC modelspresent an improvement by using two relaxation rates: Onerate for the shear moments called ω s , and a second relaxationrate ω h that controls higher order moments. Only ω h is changedaccording to the entropy condition. The shear relaxation rate ω s is not altered, and thus also the viscosity remains constant.By choosing which moment is relaxed by which relaxationrate, one obtains different KBC variants, that are labeled bythe authors as KBC-N1 up to KBC-N4.In the original work [10], the notion of mirror states andaccording relaxation parameters is used. Here we use a differentnotation that is closer to the formalism of MRT methods. Westart with an arbitrary MRT method that uses two symbolicrelaxation rates ω s and ω h . The rate ω s must include theshear moments if shear viscosity should remain constant. Thecollision operator in population space is then of the form f (cid:48) q = f q − ω s ∆ s q − ω h ∆ s h , (15)where f (cid:48) q is the post-collision state, f q the pre-collision state,and ∆ s q , ∆ h q being the coefﬁcients multiplying the relaxationrates. Then we need to maximize the entropy S ( f (cid:48) ) = − (cid:88) q f (cid:48) q ( ω h ) ln (cid:32) f (cid:48) q ( ω h ) f ( eq ) q (cid:33) (16)in every cell at every time step by varying ω h . Taking the ﬁrstderivative of (16) w.r.t. ω h we get the optimality condition (cid:88) q ∆ h (cid:34) ln (cid:32) f (cid:48) q ( ω h ) f ( eq ) q (cid:33) + 1 (cid:35) = 0 . (17)This condition could be solved numerically in every cell usingNewton’s method. However, in this case, a more efﬁcientway can be devised. We expect f (cid:48) to be close to f ( eq ) andapproximate the logarithm around up to ﬁrst order with ln( x ) ≈ x − . t The optimality condition then simpliﬁes to (cid:88) q ∆ h f (cid:48) q ( ω h ) f ( eq ) q = 0 . (18)Inserting the post-collision value as f (cid:48) q = f q − ω s ∆ s q − ω h ∆ s h , and introducing the entropic scalar product (cid:104) a, b (cid:105) E := (cid:80) q a q b q (cid:104) f ( eq ) q (cid:105) − , we can solve for w h and obtain ω h = 1 + (1 − ω s ) (cid:104) ∆ s, ∆ h (cid:105) E (cid:104) ∆ h, ∆ h (cid:105) E . (19) To obtain this result, one has to replace f q = f ( eq ) q + ∆ s q +∆ h q and use (cid:80) q ∆ h q = 0 , which holds because of the massconservation property of the collision operator. All steps leadingto (19), are implemented using sympy , to obtain an automaticderivation of KBC methods from high-level principles.Using this technique, we can construct a wide range ofentropically stabilized methods, not only for D3Q27 stencils asin [10] but for D3Q19 and D3Q15 stencils as well. Furthermore,we offer a more costly but also more general numericalmaximization procedure for the post-collision state entropy,which is based on Newton’s method. This can e.g. be usedfor cumulant methods where the update is not linear in therelaxation rates any more as in (15), but has the quadratic form f (cid:48) q = f q − a ω s − a ω s − b ω h − b ω h .IV. C OMPUTE K ERNEL G ENERATION

All steps described up to now produce a symbolic representationof the collision operator

Ω : R q → R q . Together with thestencil, represented as a list of q discrete velocities with integercomponents, an efﬁcient LB compute kernel must be generatedfor various hardware platforms. This process is discussed inthe following section. A. Simpliﬁcation

To obtain an efﬁcient formulation of the resulting computekernel, the symbolic collision operator must be rewritten ina form where as few as possible ﬂoating point operations(FLOPs) are required to compute post-collision values. This isa very challenging task, since the automatic operator derivationyields a highly inefﬁcient formulation by default. Consider,for example, the case of an SRT model that is derived bytransforming populations to moment space, relaxing with asingle rate, and transforming back. The matrix products producelengthy expressions, that are mathematically equivalent to theusual SRT formulation but are now expressed using many moreFLOPs. With standard mathematical techniques, like expandingor factoring, terms can already be simpliﬁed considerably.However, the most signiﬁcant reduction in the number ofFLOPs is achieved with common subexpression elimination(CSE). General CSE algorithms implemented in computeralgebra systems are not guaranteed to ﬁnd the global optimumand have to rely on heuristics to ﬁnd a reasonably good solution.These algorithms do not just identify common subtrees as itis typically done as a compiler optimization, but they also tryto rewrite the expressions in a form where they have morecommon subtrees.For an illustration of the optimization possible in lbmpy wepresent here results for the D3Q19 BGK method. The ﬁrstrow of Table I shows the number of FLOPs in the expressionsas they are produced by the automatic derivation. In total,this initial automatically generated code version needs 1263operations. To reduce cost, we ﬁrst employ the simpliﬁcationand CSE capabilities of the sympy computer algebra systemdirectly. The results are labeled “

Only CSE ” in the table. Thisreduces the number of operations signiﬁcantly, down to only261. However, a manually optimized implementation of the dditions Muls Divs Total

Only CSE: initial 686 574 3 1263sympy CSE 199 61 1

Custom: initial 686 574 3 1263expand 173 423 3 599quadratic velocity prod. 203 447 3 653expand 179 423 3 605factor ω ’s 179 305 3 487common quadratic term 131 161 3 295substitute existing subexpr. 119 119 3 241sympy CSE 119 73 1 Table I: Detailed simpliﬁcation results for compressible D3Q19SRTBGK method developed by the authors requires only 204FLOPs. A value around 200 FLOPs is also reported by Welleinet al. [58]. The default simpliﬁcation and CSE of sympy thusis unable to produce code as good as manually tuned, since itneeds about 30% more FLOPs than the best solutions known.We also tested the simpliﬁcation capabilities of other computeralgebra systems, including Maple and Mathematica which alsocould not ﬁnd simpliﬁcations competitive with hand-tunedcode. Therefore, it was necessary to develop a new set ofcustom transformations to rewrite the equations before they arepassed to the CSE function of sympy . These transformationsare listed in the order of application in the lower part oftable I as “

Custom ” transformations. Next to the name ofthe transformation we display the number of FLOPs after thetransformation has been applied. Some of these transformationsuse LBM application knowledge, e.g., they treat density andvelocity symbols differently.We now study these transformations one by one and describethem in detail. The initial formulation is ﬁrst expanded, i.e.,transformed into a sum of products using a function providedby sympy . The next transformation called “quadratic velocityproducts” is speciﬁcally developed for LBM simpliﬁcation.It picks out mixed quadratic terms in macroscopic velocitycomponents u i u j and replaces them by ( e − u i − u j ) / ,with a new subexpression e := u i + u j . This transformationmay seem counterintuitive since it increases the number ofFLOPs. However, it helps the following transformations toobtain better results. This is an example of a transformation thatrequires domain knowledge since this replacement may only beapplied to the velocity symbols. Next, a standard expansion isperformed. The previously introduced subexpression e preventsthis expansion from undoing the previous transformation. Thenext transformation uses a generic sympy function to factor outrelaxation rates. The “common quadratic term” step introducesa subexpression that is obtained by taking the expression forthe center point, setting all pre-collision values to zero, andrelaxation rates to one. For the TRT method this yields ρ − / ρ ( u + u + u ) . After this transformation, already existingsubexpressions like density and velocity are searched in theequations, and ﬁnally, a CSE from sympy is performed. With this custom simpliﬁcation pipeline, we eventually arrive ata kernel that costs only FLOPs. Note that this is evenslightly better than the previously known and carefully hand-optimized version. The improvement compared to using onlythe generic simpliﬁcation is . Note also, that the sameautomatic simpliﬁcation and optimization steps can now beperformed for other LBM methods.Table II shows the total number of FLOPs for a selectionof LB schemes that lbmpy can generate and optimize. Wecompare methods that use the so-called compressible andincompressible equilibrium [28]. We also compare SRT, TRT,MRT, and the SRT with Smagorinsky turbulence model. Thetable does not display in detail which type of FLOPs eachmethod is composed of. However, all methods only requireadditions and multiplications with the following exceptions:All compressible models have one division by the density, andthe Smagorinsky methods additionally require two square rootoperations. The results in the ﬁrst column are obtained byusing only the sympy

CSE. The third column uses the customsimpliﬁcation strategy introduced above. The second columnalso uses the custom simpliﬁcation strategy, with a modiﬁedCSE step at the end. In this CSE step, we ﬁrst search forsubexpressions in terms that update opposing lattice directions.By construction, these contain terms that differ in sign onlyand are good common subexpression candidates. This step isthen again followed by a global CSE. This approach is labeled“direction CSE” since lattice directions are taken into account.The lowest FLOP count is marked for each method in boldface.Let us ﬁrst discuss the results for all methods with one ortwo relaxation rates. We see that for SRT and TRT methodsthe custom simpliﬁcation pipeline consistently leads to betterresults. It is generic enough to work not only for the SRTmethod it was designed for, but leads to good results for TRTmethods as well. Also turbulence models built on top of thesecollision operators are simpliﬁed better by the custom strategy,as shown in the table with the Smagorinsky example. Whetherthe direction-aware CSE is beneﬁcial depends on the stencil.For the D3Q27, it gives the best or equal result across methods,for D2Q9 and D3Q19 it is helpful only for the SRT operators.We also have chosen two example MRT methods. One, that usesthe standard scalar product for moment orthogonalization, andone with moments that are orthogonal w.r.t. to the weightedscalar product. Second order shear and bulk moments arerelaxed with different rates, and for each order larger than 2a separate relaxation rate is chosen. Table II shows that thecustom simpliﬁcation pipeline cannot handle MRT methods.A straight application of CSE obtains much better results forall tested MRT methods, regardless of the stencil.Currently we employ these three simpliﬁcation options foreach method, and then automatically select the best one. Inthe future we plan to also use machine learning techniques tooptimize the application order of transformations or for ﬁndingnew transformations. nly CSE Custom withdirection CSE Custom,default CSE

D2Q9 compr. SRT 113

90 90 incompr. SRT 107

75 75 compr. Smag. 137

122 122 compr. TRT 114 110 incompr. TRT 108 103 compr. MRT

349 317compr. weighted MRT

350 325

D3Q19 compr. SRT 261

193 193 incompr. SRT 252

162 162 compr. Smag. 306

251 251 compr. TRT 262 233 incompr. TRT 253 225 compr. MRT

947 903

D3Q27 compr. SRT 444

370 370 compr. TRT 446

Table II: Total number of FLOPs for different LB schemes.The “Only CSE” columns runs only a CSE from sympy . The“Custom with direction CSE” runs the custom simpliﬁcationpipeline, then a PDF direction-aware CSE followed by astandard CSE. “Custom with default CSE” is the similar, butwithout the direction-aware CSE.

B. Collision Operator to Stencil1) Streaming and Collision:

After simpliﬁcation we havethe collision operator given as a function R q → R q . Itis represented by a list of q symbolic expressions for thepost-collision population values accompanied by a set ofsubexpressions. The next stage transforms this formulationinto a stencil representation.The stencil representation and all following low-level trans-formations are part of the pystencils package that is alsodeveloped by the authors [7]. pystencils generates stencilkernels, i.e., routines that iterate over arrays, applying thesame operation on every cell. It distinguishes between spatialand index dimensions. Only spatial dimensions are iterated over,while index dimensions are used to address values stored insidea cell, e.g., the q populations for a PDF array or componentsof a vector ﬁeld. The central concept of pystencils are ﬁelds,and ﬁeld accesses. A ﬁeld is deﬁned by a name and its numberof spatial and index coordinates. Fields are indexed relatively,so the ﬁeld access f[1][0](q) , for example, refers to the q ’th population value of the east neighbor cell in a 2D setup. pystencils is built on top of SymPy , and ﬁeld accesses can beused just like a built-in symbol. The collision operator can betransformed into a stencil representation by replacing the pre-and post-collision symbols by ﬁeld accesses. Two additionalpieces of information are required for this process. The userhas to choose the data layout of the population array and a https://i10git.cs.fau.de/pycodegen/pystencils Figure 2: Visualization of stream-pull-collide update patternusing two arrays. The left part encodes the reads of pre-collisionvalues, the right part shows where post-collision values arewritten to.kernel type that describes the operations done inside a kernelcall. lbmpy supports three different population storage options. Thesimplest approach is to have two arrays, where, during onekernel invocation, one array is read-only and the second arrayis write-only. For this storage pattern the system can generatea fused stream-pull-collide, a fused collide-stream-push, or acollision-only kernel. For a pure LBM simulation that is notcoupled to other simulation models, typically a stream-pull-collide kernel gives the best performance. In lbmpy the dataaccess patterns are encoded by the ﬁeld accesses where pre-collision values are loaded from, and ﬁeld accesses where post-collision values are written to. These are visualized in ﬁg. 2 fora stream-pull-collide kernel. This mechanism cleanly separatesthe LB method deﬁnition from algorithmic- and data structureaspects, avoiding any code duplication.Besides the simple two array swapping technique, lbmpy supports also more advanced storage patterns that operateon a single PDF array and thus require only half the memory.Supported single-array schemes are the AA pattern [3] andthe esoteric twist (EsoTwist) update scheme [19]. Single arraystorage schemes that introduce a data dependency between cellupdates like [44] are not supported.Figure 3: AA update pattern. The two leftmost schematicsshow the even time step consisting of an in-place collisionwith inversed storage of populations. The odd time step (right)is a fused stream-pull, collide, stream-push step.To be able to process all cells in parallel, while only havinga single array for PDF storage, the AA pattern needs twodifferent access patterns for even and odd time steps (ﬁg. 3).This also leads to two different kernels that have to be run in analternating fashion. The different data layout after even and oddsteps may complicate boundary handling and coupling the LBMo other solvers, when traditional implementation techniquesare used. With our code generation approach this additionalcomplexity can be handled automatically. The symbolic, highlevel formulation of method, boundaries, and update schemeis sufﬁcient to e.g. generate boundary handling for even andodd steps automatically.Figure 4: Esoteric twist split in even (left) and odd (right) timestep kernels. Black arrows indicate reads, red arrows writes.The esoteric twist pattern also requires only a single array. Incontrast to the AA pattern, it was designed to not require aneven and odd time step. If the populations for different latticevelocities are stored in separate arrays, a pointer swappingtechnique can be used for streaming. In lbmpy , however, wedo not use this technique, in order to keep a common kernelinterface with a single population array. Instead we also usean even and an odd time step for the EsoTwist pattern as well(ﬁg. 4).

2) Boundary Conditions:

In this section we discuss in moredetail how boundary conditions are realized algorithmically.One option is to leave the LB kernel unchanged, and runseparate boundary handling kernels before. These kernelsprepare the population array by writing values that will bestreamed in from boundary cells. lbmpy can take symbolicboundary deﬁnitions, as shown above, and generate one kernelper boundary. In the simplest case, these boundary kernelsoperate on a rectangular subdomain, e.g., at the borders ofthe computational domain. This very simple, but also verycommon case can thus be handled in the most efﬁcient waypossible.For more general boundary shapes a ﬂag ﬁeld is used. The ﬂagsstore a bitmask in every cell that encodes the type of boundary.The ﬂag ﬁeld is initialized by the user using image/voxel dataor with the help of surface meshes. The boundary conditionkernel could iterate over the full domain, masking out cells, butespecially if boundary conditions are static, this would be ratherinefﬁcient. Thus, lbmpy also offers an alternative approach,where instead of iterating over all cells, a pre-processing stepextracts boundary cell coordinates from the ﬂag ﬁeld andcreates a list of indices for each boundary condition. Thisindex list contains the spatial coordinate of the boundary celltogether with the lattice direction to the neighboring ﬂuidcell. Consequently, this list has one entry per boundary link.For each boundary link, custom boundary data can be storedadditionally, e.g., the wall velocity for a velocity bounce-backboundary. The ﬂag ﬁeld then only acts as a convenient way tosetup boundaries. During the simulation itself, it is not required any more since all necessary information is stored in the listdata structure to accelerate the processing.So far we have studied time steps where the boundaries aretreated in a separate kernel. With lbmpy , boundaries can alsobe compiled directly into the LB compute kernel. Then aconditional is added to the kernel that determines the cell typeeither by using a ﬂag ﬁeld or by a boolean expression thatdepends on the spatial coordinates.Using the information about the data access pattern, thesymbolic formulation is transformed to obtain a concreteboundary assignment. We point out that it is particularlybeneﬁcial to automate the error prone implementation ofboundary conditions for the single-array patterns AA orEsoTwist.All boundary treatment options discussed above are not newsince they have already been implemented in existing frame-works or applications. The new contribution here is that codefor all these options can be automatically generated avoidingtedious manual coding and debugging. Since all versions canbe generated easily, this makes it possible to benchmark allversions and to choose the fastest version for a speciﬁc setup.Furthermore, there is no trade-off between ﬂexibility andperformance any more. It is now possible to compile a speciﬁcboundary treatment into a kernel to get an application-speciﬁcimplementation with the best performance. Different from ahand-tuned version of the same method, the lbmpy approachkeeps the system maintainable and extensible. The separationof concerns is realized on the symbolic abstraction level.

C. Transformations in the Intermediate Representation

In this section we describe low level optimizations executedon the intermediate representation of pystencils with the goalto further accelerate the LB compute kernels. pystencils isdesigned as a modular package that allows the user to writecustom code transformations speciﬁc to the application.

1) Splitting inner loop:

For LB kernels we expect that thememory interface to be the performance limiting factor, ifdomains exceed the capacity of the outer level cache. The ﬁrstoptimization we discuss here, aims to increase the maximumattainable bandwidth of the kernel by reducing the number ofparallel load/store streams from/to memory. A standard LBkernel iterates over all cells, loads all q pre-collision valuesat once, computes the post-collision values and stores all q of them. This leads to q parallel load and store streams.Reducing the number of parallel streams to memory canincrease the obtained bandwidth [57]. Therefore we developan automatic transformation that splits the innermost loopinto multiple smaller loops. To avoid the re-computation ofcommon subexpressions in every inner loop, buffer arrays areintroduced. The ﬁrst inner loop computes density and velocityand writes them to the buffer arrays. The following loopsthen handle only two lattice direction updates and have onlytwo parallel load and store streams. Algorithm 1 shows thestate after the transformation in pseudo-code. It assumes asimple two-ﬁeld storage pattern with source and destinationarray. There are different options on how to exactly perform lgorithm 1 Stream-collide kernel with split inner loops for all slices y, z do ρ arr ← array[x-size]u arr ← array[x-size] for line x do f ← src [ x, y, z ] ρ arr[x] ← ρ ( f ) u arr[x] ← u ( f ) dst[x, y, z, center] ← Ω( f, ρ arr[x] , u arr[x] ) end forfor line x do f ← src [ x, y, z ] dst[x, y, z, east] ← Ω( f, ρ arr[x] , u arr[x] ) dst[x, y, z, west] ← Ω( f, ρ arr[x] , u arr[x] ) end forfor line x do f ← src [ x, y, z ] dst[x, y, z, north west] ← Ω( f, ρ arr[x] , u arr[x] ) dst[x, y, z, south east] ← Ω( f, ρ arr[x] , u arr[x] ) end for ... (more loops for remaining directions) end for this transformation. One free parameter is the number ofdirections that are updated in the inner loops. In the example,we update two opposing directions at once, since these updatesshare many common subexpressions. One could also create aseparate inner loop for each direction, or group more than twodirections together. This transformation is also parametrized bythe common subexpressions that are pre-computed in temporaryarrays. lbmpy can introduce additional temporary arrays forother subexpressions besides density and velocity as well. Aheuristic is used to determine subexpressions that are computeintensive enough to justify introducing a temporary array forthem. It is important that all temporary arrays ﬁt into the innerlevel cache so that they do not generate additional pressureon the memory interface. In the simple example, as shown inalgorihm 1, the temporary arrays grow with the domain sizein x -direction. To make this optimization work for arbitrarydomain sizes, the inner loop is blocked before splitting it up.The chunk size can be selected such that the arrays ﬁt into L1cache.

2) OpenMP and SIMD vectorization:

All LB kernels aredesigned in a way that cells can be updated in parallel. pystencils uses the fact that loop iterations are independent toautomatically parallelize the kernel with OpenMP. By defaultthe outer loop is parallelized using a static scheduling strategy.If the domain size is known at compile time, and the outerdimension is very small, pystencils uses OpenMP collapse to increase the number of parallel interations.Knowing that iterations are independent, allows pystencils tovectorize the code. To have full control over the vectorizationprocess, we do not rely on compiler auto vectorization orpragma-based approaches but generate C code with SIMD intrinsics. pystencils currently supports SSE, AVX, AVX2 andAVX512 vector instruction sets. If the data layout and alignmentof the population array is known at compile time, we generatealigned load/store instructions where possible.

3) Non-temporal stores:

The intrinsics-based vectorizationallows us to explicitly use non-temporal (NT) stores, also called streaming stores , in kernels that use two population arrays. Thisoptimization reduces the total amount of data that has to betransferred from/to memory. By default, modern CPUs havea “write-allocate” or “read for ownership” cache policy [61].This means that a store operation causes the respective cacheline to be read into cache and thus generates twice the memorytrafﬁc that is actually required. This actually causes q valuesto be transferred over the memory interface per lattice cell. Thecustom vectorization allows us to change the store instructionsfrom default to streaming stores that bypass the cache. Thenonly q values have to be loaded and stored per cell. So thisoptimization can increase the performance of two-array LBkernels by a factor of 1.5, assuming they are memory-boundand the PDF array does not ﬁt into the outer level cache. D. Framework Integration

The intermediate representation of the compute- and boundarykernels is ﬁnally transformed by a backend to either C, CUDAor OpenCL code. For each kernel a C function with a well-deﬁned interface is generated. Arrays are passed in as rawpointer, together with shape and stride information that deﬁnethe memory layout of the arrays. Symbolic quantities thathave not been replaced during the code generation processautomatically become parameters to the generated C function,e.g., values for relaxation rates or constant external forces.This simple interface was chosen, such that the generatedkernels can be called from a variety of different languagesand can be easily integrated into existing frameworks. In thissection we describe different ways of utilizing lbmpy . Theﬁrst option allows the user to completely work in a Pythonenvironment, preferably an interactive Jupyter notebook for aconvenient display of symbolic expressions. There, the userderives the LBM symbolically and passes the method deﬁnitionto lbmpy . After automatic simpliﬁcation and optimization thegenerated C/CUDA/OpenCL code is automatically compiledand dynamically loaded as a Python module. The compilationprocess is fully transparent to the user. The optimized, shared-memory parallel kernel can then be directly called from Python.Data is stored in numpy for CPU simulation or in gpuarray ’sfrom the pycuda package for GPU simulations. In this mode lbmpy offers a ﬂexible and fast prototyping environment forLB methods, where simulations can be run on a single nodeor a single GPU.For distributed memory parallelization we use the WA L-B

ERLA framework [16], a multiphysics software system thatis optimized for massively parallel simulations with stencilcodes. The distributed memory parallelization uses a block-structured domain partitioning based on a forest of octreesand is characterized by excellent scalabilty since it usesno central nor globally shared data structures so that nolobal communication is necessary [51]. This fully paralleldata structure enables adaptive grid reﬁnement and dynamicload balancing between MPI processes [52], [5]. WA LB ERLA has Python bindings [8] that allow for simple distributedsimulations with lbmpy generated kernels directly from Pythonon a uniform grid. For advanced use cases, e.g., those thatrequire grid reﬁnement, the user has to switch to C++ as thedriving language. Integrations of lbmpy into the CMake buildsystem of WA LB ERLA control the generation of LB computekernels, boundary kernels and packing/unpacking kernels fordistributed memory MPI communication. The reason why wealso generate communication kernels are the single-ﬁeld AAand EsoTwist storage patterns. Manually determining whatvalues have to be sent to neighboring processes is tediousand error-prone in these cases. Since the compute kernels areavailable in symbolic form, we can extract that information andgenerate the necessary communication routines automatically.V. P

ERFORMANCE R ESULTS

In this section we present benchmark results using the auto-matically generated LB kernels and compare them to manualimplementations with different optimization level.

A. Single Node Benchmark1) Hardware:

We ﬁrst investigate the single-node performanceon two test systems. We scale all kernels on one socket of twoIntel Xeon processors with different microarchitecture. Theﬁrst system is an Intel Xeon E5-2695v3 Haswell system with14 physical cores per socket. For benchmarking, we deactivatethe turbo mode of this processor and set the frequency to aﬁxed value of 2.3 GHz using the likwid tool suite [56]. Thesecond system is an Intel Xeon Gold 6148 CPU Skylake thathas 20 cores per socket with a ﬁxed frequency of 2.4 GHz. TheSub-NUMA clustering features of both systems are switchedoff to have one NUMA domain per socket. We use transparenthuge pages and disable automatic NUMA balancing in theLinux kernel. For all benchmarks we use a domain size of × × that is too large to ﬁt in the outer level cacheof any of the tested systems.To ﬁnd an upper bound for the possible performance, assumingkernels are memory-bound, we use bandwidth measurementsfor both systems from [60]. Table III shows the measured copy bandwidth for both machines where write-allocateshave already been taken into account. Additionally, we usebandwidth measurements of scenarios that closely mimic thememory access behaviour of the D3Q19 LB kernels. Kernelswith two-array population storage are compared to a streambenchmark with 19 parallel streams and non-temporal stores,labeled copy-19-nt-sl . Kernels with a single-array updatepattern are compared to a benchmark that updates 19 valuesin place called update-19 . The upper bound for the kernels,as measured in million lattice updates per second (MLUP/s),the bandwidth is divided by the number of bytes that have tobe transferred per lattice cell. We assume double precision forall kernels. Thus each cell update requires · q · bytes percell. processor Xeon E5-2695v3 Xeon Gold 6148micro architecure Haswell Skylakecores per socket 14 20frequency 2.3 GHz 2.4 GHz Measured Bandwidths copy copy-19-nt-sl update-19

Table III: Test system speciﬁcation with measured bandwidthsfrom [60]. copy uses one load and store stream, write-allocateis already taken into account. copy-19-nt-sl uses 19load and store streams and non-temporal stores. update-19 updates 19 values in-place.The benchmark codes are compiled with Intel compiler 19.0.2if not speciﬁed otherwise. Where explicitly noted, the GCCin version 7.4.0 is used. For both compilers we set theoptimization ﬂag -O3 , enable AVX512 on Skylake and AVX2on Haswell, and switch on fast math ﬂags that allow thecompiler to reorder ﬂoating point operations. For the intel com-piler these are -fp-model fast=2 -no-prec-sqrt-no-prec-div and for GCC -ffast-math . All kernelsare parallelized with OpenMP.

2) Two-array kernels and comparison to manual implemen-tations:

As discussed before, there is a trade-off betweencode quality and performance when developing LB kernelsmanually in C/C++. Code is optimized by specializing it fora particular scenario. To illustrate this trade-off we compare lbmpy generated kernels with two manual implementations. Thecode of the manual implementations can be accessed at [4]. Forthis test we restrict ourselves to a SRT collision operator. Theﬁrst manually implemented code is written in a stencil-agnosticway, where lattice velocities and stencil weights are abstractedaway through template meta-programming. Theoretically, thecompiler should be able to resolve these indirections fullyat compile-time. Relaxation rates are also passed in via atemplated functor, to enable a ﬂexible integration of turbulencemodels. While the main aim of this kernel is to be as genericas possible it is still restricted to a single collision operator anda single two-array population storage pattern. But it is easilyreadable and extensible.The second manual implementation is written speciﬁcallyfor a D3Q19 stencil. All loops over lattice directions aremanually unrolled, expressions are simpliﬁed by leaving outmultiplications with zero lattice direction components, andcommon subexpressions are eliminated. These steps lead tocode duplication and decreased readability, but may lead tobetter performance. These optimization steps should not benecessary since the compiler should be able to do them auto-matically. However, the unrolled stencil-speciﬁc version is thebasis of further optimization like loop splitting. Figure 5 showsbenchmark results on the Skylake system for both manuallyimplemented kernels using GCC and the Intel compiler. Wecan see that indeed the Intel compiler (right) was able toresolve the compile-time abstractions, such that the genericversion is as fast as the stencil-speciﬁc one. However, GCC

Cores M L U P / s SRT D3Q19, Skylake, GCC

Cores

SRT D3Q19, Skylake, Intel Compiler manual, genericmanual, stencil speciﬁclbmpy, NT:0, Split:0 lbmpy, NT:0, Split:1lbmpy, NT:1, Split:0lbmpy, NT:1, Split:1

Figure 5: Comparison of kernels with different optimization level on Skylake using a BGK method with two-array populationstorage on a × × domain. Horizontal lines indicate the rooﬂine estimate using the measured copy-19-nt-sl (lower) and copy bandwidth (higher).cannot optimize the generic code automatically, only obtainingabout half the performance. The manual implementations scaleperfectly, but are far from utilizing the available bandwidth onthe system. The generated kernel without loop splitting andnon-temporal stores already performs better than the manualimplementations. This kernel is explicitly vectorized withAVX512 SIMD intrinsics and uses pointer arithmetic to accessthe population arrays, whereas the manual implementationsuse getter/setter methods of an array class. Splitting the innerloops lets the kernel saturate at about 200 MLUP/s. Due to thewrite-allocate strategy in total 1.5 times more data is movedacross the memory interface than necessary. As can be seen inthe plot, the performance of this kernel is consequently alsoabout a factor of 1.5 worse than the best kernel with NT stores.Activating NT-stores results in the expected performance ofabout 300 MLUP/s on this system, very close to the maximal304 MLUP/s predicted by the rooﬂine estimate obtained withthe copy-19-nt-sl bandwidth. So the loop splitting andnon-temporal stores optimizations are indeed necessary toobtain best possible performance on this system. All manualimplementations, where these optimizations have been applied,are lengthy and hard to read, due to the unrolled loops andthe usage of SIMD intrinsics. These hand-optimized codesare not only difﬁcult and time-consuming to develop, but alsotheir maintainability and ﬂexibility have been sacriﬁced forperformance. With code generation it is possible to resolvethese conﬂicting goals. Figure 5 also shows, that the generatedcode performs consistently across different compilers, sinceall abstractions are already transformed to perform with bestpossible efﬁciency by the code generation system, leaving onlystandard optimizations to the back end compiler.Figure 6 shows the corresponding results for the Haswellsystem. Overall the behavior of this older system is similar to Cores M L U P / s SRT D3Q19, Haswell, Intel Compiler manual, genericmanual, stencil speciﬁclbmpy, NT:0, Split:0 lbmpy, NT:0, Split:1lbmpy, NT:1, Split:0lbmpy, NT:1, Split:1

Figure 6: Two-ﬁeld BGK D3Q19 kernels on Haswell. Conﬁg-uration and rooﬂine indicators as in ﬁg. 5.Skylake, with the exception that the system is apparently notable to handle parallel non-temporal store streams, as theversion with NT-stores without loop splitting performs verypoorly.

3) Kernels with AA pattern and boundary handling:

Next,we show performance results for single-array kernels that usethe AA update pattern. We use the TRT collision operator forthese benchmarks. SRT and TRT kernels have very similarperformance characteristics, because they have about the samenumber of FLOPs. In ﬁg. 7 we compare the best two-ﬁeldversion with split loops and non-temporal stores to the corre-sponding AA kernel. The two-array kernel saturates at about13 cores, the AA version achieves the highest performance

Cores M L U P / s TRT, Skylake

AA D3Q19 lbmpyAA D3Q27 lbmpyTwo-array D3Q19 lbmpy NT:1, Split: 1AA D3Q19 list-based, Wittmann et al.

Figure 7: AA pattern, TRT on Skylake. Rooﬂine estimate uses update-19 bandwidth.already with 6 cores. The additional development effort thatis required for the AA pattern pays off not only in half thememory consumption but also in single core performance.For the AA kernels the NT-store optimization is not applicable,since all values are updated in-place. The inner loop splitting,however, may be beneﬁcial. Since there are two differentkernels for even and odd time steps, there are in total fouroptions, where the splitting transformation has been applied tonone, only one, or both kernels. We ﬁnd, that loop splittingdoes not help in this case. All four options yield almost equalperformance results. Thus, ﬁg. 7 only shows the version whereneither of the two kernels has been split.We can also see, that the rooﬂine limit based on the measured copy-19 bandwidth is a very good model for the performanceon the full socket. The D3Q27 version saturates at very closeto the expected value, that is by a factor of / lower thanthat of the D3Q19 stencil.Figure 7 also shows performance results of the TRT LBbenchmark kernel by Wittmann et al. [60]. From this benchmarkwe use the fastest kernel list-aa-pv-soa on a channelgeometry. It also uses the AA pattern in a SoA layout. Incontrast to the lbmpy kernels, it operates on a sparse list datastructure, such that only populations in ﬂuid cells have to bestored. Also, the benchmark kernels have boundary handlingbuilt in, while the lbmpy results in ﬁg. 7 show the performanceof the compute kernel only.The boundary handling performance of lbmpy is investigated inﬁg. 8. It shows the lbmpy kernels where two different boundaryhandling approaches are used for a channel scenario, wherenon-periodic boundaries are set on all sides. The simplest optionis to generate separate, external kernels that handle boundaries.Since cache lines containing population at the border must beloaded twice during a time step, the ﬁnal performance obtainedat the full socket is decreased by about 15%. The secondapproach introduces conditionals in the compute kernel. Withthis approach we can obtain about the same performance on Cores M L U P / s TRT, Skylake, AA Pattern

D3Q19 lbmpy: ext. boundaryD3Q19 lbmpy: compiled-in boundaryD3Q19 sparse, Wittmann et al.

Figure 8: TRT collision operator on Skylake using AA patternwith different boundary options.the full socket as the pure compute kernel, but the performanceon a few cores is much lower. The intrinsics-based SIMDvectorization in pystencils cannot handle the conditionals in anoptimal way yet. This limitation is expected to be remedied infuture work.

4) Advanced collision operators:

The goal of code generationin lbmpy is not to make a single collision operator fast, butprovide a framework that is can obtain good performance fora wide range of different LBMs. Figure 9 shows results fordifferent D3Q19 LB schemes on the test systems using theAA update pattern. The TRT results, we have seen above areincluded for reference again. A slightly more complex schemeis the BGK operator with included Smagorinsky turbulencemodel. In this kernel, the relaxation rate is determined ona cell-by-cell basis. The computation of the adapted rate isdone on the ﬂy inside the kernel, to not introduce additionalmemory accesses. Schemes with variable relaxation rates areoftentimes implemented in a way where the rates are computedin a separate kernel and stored into an additional array, forﬂexibility reasons. This is not necessary in lbmpy , so that theSmagorinsky kernel obtains identical performance as the TRTkernel on the full socket. On Skylake also the performanceon small core counts is almost identical to the TRT kernel,whereas on Haswell the additional computational complexity,like e.g., the two sqrt operators per cell, lead to lower singlecore performance compared to TRT.Next, we investigate performance characteristics of an MRTkernel with weighted orthogonal moments. It has four relaxationrates, two for controlling shear and bulk viscosity separately,one for third, and one for forth order moments. All relaxationrates remain symbolic at compile time and become run timeparameters. lbmpy is capable to optimize this model so that italmost runs as fast as the TRT model on both test architectures.This is true also for other MRT models that the system cangenerate, e.g. weighted/unweighted moment orthogonalizationor compressible/incompressible equilibria.

Cores

Haswell

Cores

Skylake

TRT Smagorinsky MRT Cumulant Entropic

Figure 9: Comparison of different LB collision operators. All kernels use the AA pattern and a D3Q19 stencil.Figure 9 also contains measurements for a D3Q19 cumulantmethod. The non-linear transformation to cumulant-spacemakes this collision operator more compute intensive thanMRT methods. Nonetheless, lbmpy can optimize the cumulantkernel such that it saturates the available memory bandwidth onboth systems. Additionally, we try an entropic method of KBCtype. Shear and bulk viscosity is kept ﬁx, the relaxation ratefor higher order moments is chosen adaptively to maximizeentropy. This method is too compute intensive to be memory-bound on Haswell, but on a full Skylake socket it achievesperformance similar to the simpler methods.Summarizing our ﬁndings, after careful optimization there isno performance penality using complex LB collision operators.On modern CPU architectures all LBM implementaions arememory bound when they are properly optimized. Thisoptimization, however, is only achievable by tedious manualcoding by experts or by using automatic code generation withtools like lbmpy . B. Scaling Benchmark

Integrating the generated lbmpy kernels into the WA LB ERLA framework allows us to run large scale simulations on dis-tributed memory systems. We use the MPI communicationcapabilities of WA LB ERLA together with generated serializa-tion/deserialization kernels to run a large parallel simulations.In contrast to previous work [23], [6], where scaling results formanual implementations of TRT two-ﬁeld kernels have beenshown, we demonstrate the performance of a more complexMRT kernel with AA pattern here. As we have shown above,this kernel runs as fast as a SRT or TRT collision operator onthe full node when properly optimized.As test system the SuperMUC-NG supercomputer in Munichis used. It consists out of Intel Xeon Platinum 8174 processorswith Skylake architecture. Each node has two sockets with 24 Cores M L U P / s p e r c o r e Weak Scaling MRT on SuperMUC-NG block size (300, 100, 100)block size (16, 16, 16)

Figure 10: Weighted orthogonal MRT method with AA updatepattern scaled on SuperMUC-NG up to 147,456 cores (3072nodes) using a channel geometry.physical cores each. We run a weak scaling setup up to the3072 compute nodes that we have access to. This is about halfof the full machine size.Figure 10 shows the scaling results of a channel geometry. Thedomain is partitioned into equally sized blocks and each blockis assigned to a physical core. The machine is best utilizedwhen choosing large block sizes, since the communicationoverhead is kept small in this case. For a block size of (300 , , , we observe perfect scalability up to all 147,456cores used. In this conﬁguration we obtain about 1972 GLUP/son half of SuperMUC-NG. Besides this weak scaling scenariodemonstrating maximal GLUP/s rates, we may alternativelywant to maximize the number of time steps per second. Tollustrate the capabilities of lbmpy with WA LB ERLA we alsostudy a scaling scenario with much smaller block size. ﬁg. 10shows the scaling behavior for a block size of . Herewe achieve still good scalability up to 1024 nodes, thenthe performance in GLUP/s drops to approximately 40% ofthe performance that was observed for the large block size.Note however, that in this conﬁguration we can execute 1327time steps per second. Note also that on the 3072 nodes onSuperMUC-NG this is still for an LBM grid consisting of × LBM cells.VI. C

ONCLUSION AND O UTLOOK

In this article, we presented a programming system named lbmpy that supports the ﬂexible creation of highly optimizedparallel LBMs. The scope of lbmpy are moment-based MRTmethods plus cumulant and entropically stabilized collisionoperators. All methods can be created with locally varyingrelaxation parameters so that various turbulence models canbe realized or also models for non-Newtonian ﬂuids. lbmpy automatically optimizes the compute kernels with domain-speciﬁc transformations and it can produce codes that employmemory-efﬁcient single-array population storage. Even forcomplex LBM models, the kernels generated by lbmpy canreach the same performance as manually optimized state-of-the-art TRT implementations of [60]. After automatic optimization,all methods are memory-bound on a recent Skylake system.Thus, also advanced collision operators can be used withoutperformance penalty, provided that large domain sizes are usedand memory bandwidth is the bottleneck. Through integrationinto the HPC framework WA LB ERLA , the lbmpy -generatedmethods can be executed on large scale distributed-memorysystems with excellent scalability.The lbmpy / WA LB ERLA programming system supports thecomputational science workﬂow. It automizes the tediousand error prone development steps. The support of lbmpy is vertically integrated : it starts with the development ofadvanced kinetic schemes, it assists the development of scalableparallel codes, and it includes advanced hardware-speciﬁccode optimizations. Note that the derivation of modern LBschemes will usually require expertise in mathematics andphysics by specialists in LBMs, while the nodel-level kerneloptimization for modern CPU microarchitecture will require adetailed understanding of CPU microarchitecture. In this sense lbmpy / WA LB ERLA is an excercise in interdisciplinary co-designto create advanced simulation software for future extreme schalecomputing. The code-generation paradigm permits a higherlevel of abstraction than it can be realized by conventionalsoftware engineering methods. These new, application-speciﬁcmethods of abstractions can only be realized with automaticcode generation. They help to resolve the fundamental conﬂictbetween ﬂexibility of software and the need for hardwarespeciﬁc optimizations. The approach taken in lbmpy offersa road to performance portability and thus to improve thesustainability of scientiﬁc software. VII. A

CKNOWLEDGEMENTS

The authors are grateful for funding received through theproject HPC2SE (01ICH16003D) from the Bundesministeriumf¨ur Bildung und Forschung (BMBF). Additionally we aregrateful to the Regionales Rechenzentrum Erlangen, theLeibniz-Rechenzentrum in Garching and the Swiss NationalSupercomputing Centre in Lugano for providing compute timeon their HPC systems. R

EFERENCES[1] A

LNÆS , M. S. ; L

OGG , A. ; Ø

LGAARD , K. B. ; R

OGNES , M. E. ;W

ELLS , G. N.: Uniﬁed form language: A domain-speciﬁc language forweak formulations of partial differential equations. In:

ACM Trans. Math.Software

40 (2014), Nr. 2, S. 9[2] A

NSUMALI , S. ; K

ARLIN , I. V. ; ¨O

TTINGER , H. C.: Minimal entropickinetic models for hydrodynamics. In:

Europhysics Letters

63 (2003),Nr. 6, 798–804. http://dx.doi.org/10.1209/epl/i2003-00496-6. – DOI10.1209/epl/i2003–00496–6. – ISBN 0295–5075[3] B

AILEY , P. ; M

YRE , J. ; W

ALSH , S. D. ; L

ILJA , D. J. ; S

AAR , M. O.:Accelerating lattice Boltzmann ﬂuid ﬂow simulations using graphicsprocessors. In:

IEEE, 2009, S. 550–557[4] B

AUER , M. :

Implementation manual LB kernels . https://github.com/lssfau/walberla/blob/48bd19800b0c46030ae7f5e510e896a9154d78b8/apps/benchmarks/UniformGridGenerated/ManualKernels.h, 2020. –[Online; accessed 24-January-2020][5] B

AUER , M. ; E

IBL , S. ; G

ODENSCHWAGER , C. ; K

OHL , N. ; K

URON , M.; R

ETTINGER , C. ; S

CHORNBAUM , F. ; S

CHWARZMEIER , C. ; T H ¨ ONNES ,D. ; K ¨

OSTLER , H. u. a.: waLBerla: A block-structured high-performanceframework for multiphysics simulations. In:

Computers & Mathematicswith Applications (2020)[6] B

AUER , M. ; E

IBL , S. ; G

ODENSCHWAGER , C. ; K

OHL , N. ;K

URON , M. ; R

ETTINGER , C. ; S

CHORNBAUM , F. ; S

CHWARZMEIER ,C. ; T H ¨ ONNES , D. ; K ¨

OSTLER , H. ; R ¨

UDE , U. : waLBerla: Ablock-structured high-performance framework for multiphysics sim-ulations. In:

Computers & Mathematics with Applications (2020).http://dx.doi.org/https://doi.org/10.1016/j.camwa.2020.01.007. – DOIhttps://doi.org/10.1016/j.camwa.2020.01.007. – ISSN 0898–1221[7] B

AUER , M. ; H ¨

OTZER , J. ; E

RNST , D. ; H

AMMER , J. ; S

EIZ , M. ;H

IERL , H. ; H ¨

ONIG , J. ; K ¨

OSTLER , H. ; W

ELLEIN , G. ; N

ESTLER ,B. u. a.: Code generation for massively parallel phase-ﬁeld simulations.In:

Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis

ACM, 2019, S. 59[8] B

AUER , M. ; S

CHORNBAUM , F. ; G

ODENSCHWAGER , C. ; M

ARKL ,M. ; A

NDERL , D. ; K ¨

OSTLER , H. ; R ¨

UDE , U. : A Python extensionfor the massively parallel multiphysics simulation framework waLBerla.In:

International Journal of Parallel, Emergent and Distributed Systems

31 (2016), Nr. 6, 529–542. http://dx.doi.org/10.1080/17445760.2015.1118478. – DOI 10.1080/17445760.2015.1118478[9] B

AUER , M. ; S

ILVA , G. ; R ¨

UDE , U. : Truncation errors of the D3Q19lattice model for the lattice Boltzmann method. In:

Journal of Computa-tional Physics

405 (2020), 109111. http://dx.doi.org/https://doi.org/10.1016/j.jcp.2019.109111. – DOI https://doi.org/10.1016/j.jcp.2019.109111.– ISSN 0021–9991[10] B ¨

OSCH , F. ; C

HIKATAMARLA , S. S. ; K

ARLIN , I. V.: Entropicmultirelaxation lattice Boltzmann models for turbulent ﬂows. In:

PhysicalReview E - Statistical, Nonlinear, and Soft Matter Physics (2015).http://dx.doi.org/10.1103/PhysRevE.92.043309. – DOI 10.1103/Phys-RevE.92.043309. – ISSN 15502376[11] D’H

UMI ` ERES , D. ; G

INZBURG , I. ; K

RAFCZYK , M. ; L

ALLEMAND ,P. ; L UO , L.-S. : Multiple-relaxation-time lattice Boltzmann models inthree dimensions. In: Philosophical transactions. Series A, Mathematical,physical, and engineering sciences

360 (2002), Nr. 1792, S. 437–451.http://dx.doi.org/10.1098/rsta.2001.0955. – DOI 10.1098/rsta.2001.0955.– ISBN 1364503X[12] D ¨

UNWEG , B. ; S

CHILLER , U. D. ; L

ADD , A. J. C.: Statistical mechanicsof the ﬂuctuating lattice Boltzmann equation. In:

Physical ReviewE - Statistical, Nonlinear, and Soft Matter Physics

76 (2007), Nr.3, S. 1–10. http://dx.doi.org/10.1103/PhysRevE.76.036704. – DOI10.1103/PhysRevE.76.036704. – ISSN 1539375513] E

DWARDS , H. C. ; T

ROTT , C. R. ; S

UNDERLAND , D. : Kokkos: Enablingmanycore performance portability through polymorphic memory accesspatterns. In:

Journal of Parallel and Distributed Computing

74 (2014),Nr. 12, S. 3202–3216[14]

ELBE : . – accessed on 2020-01-21[15] F ARBER , R. :

Parallel programming with OpenACC . Cambridge :Morgan Kaufmann, 2017[16] F

EICHTINGER , C. ; D

ONATH , S. ; K ¨

OSTLER , H. ; G ¨

OTZ , J. ; R ¨

UDE ,U. : WaLBerla: HPC software design for computational engineeringsimulations. In:

Journal of Computational Science

EICHTINGER , C. ; H

ABICH , J. ; K ¨

OSTLER , H. ; R ¨

UDE , U. ; A

OKI , T. :Performance modeling and analysis of heterogeneous lattice Boltzmannsimulations on CPU–GPU clusters. In:

Parallel Computing

46 (2015), S.1–13[18] F

RAPOLLI , N. ; K

ARLIN , I. V.: Entropic Lattice Boltzmann Modelsfor Thermal and Compressible Flows. (2017). http://dx.doi.org/10.3929/ethz-a-010782581. – DOI 10.3929/ethz–a–010782581. ISBN8610828378018[19] G

EIER , M. ; S CH ¨ ONHERR , M. : Esoteric Twist: An Efﬁcient in-PlaceStreaming Algorithmus for the Lattice Boltzmann Method on MassivelyParallel Hardware. In:

Computation

EIER , M. ; S CH ¨ ONHERR , M. ; P

ASQUALI , A. ; K

RAFCZYK , M.: The cumulant lattice Boltzmann equation in three dimensions:Theory and validation. In:

Computers and Mathematics with Appli-cations (2015). http://dx.doi.org/10.1016/j.camwa.2015.05.001. – DOI10.1016/j.camwa.2015.05.001. – ISSN 08981221[21] G

INZBURG , I. ; V

ERHAEGHE , F. ; D ’H UMIERES , D. : Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressureand mixed boundary conditions. In:

Communications in ComputationalPhysics

ODENSCHWAGER , C. ; S

CHORNBAUM , F. ; B

AUER , M. ; K ¨

OSTLER , H.; R ¨

UDE , U. : A framework for hybrid parallel ﬂow simulations with atrillion cells in complex geometries. In:

Proceedings of the InternationalConference on High Performance Computing, Networking, Storage andAnalysis

ACM, 2013, S. 35[23] G

ODENSCHWAGER , C. ; S

CHORNBAUM , F. ; B

AUER , M. ; K ¨

OSTLER ,H. ; R ¨

UDE , U. : A Framework for Hybrid Parallel Flow Simulationswith a Trillion Cells in Complex Geometries. In:

Proceedings of theInternational Conference on High Performance Computing, Networking,Storage and Analysis . New York, NY, USA : ACM, 2013 (SC ’13). –ISBN 978–1–4503–2378–9, S. 35:1–35:12[24] G

ROEN , D. ; H

ENRICH , O. ; J

ANOSCHEK , F. ; C

OVENEY , P. ; H

ARTING ,J. : Lattice-Boltzmann methods in ﬂuid dynamics: Turbulence andcomplex colloidal ﬂuids. In:

J¨ulich Blue Gene/P Extreme ScalingWorkshop , 2011, S. 17[25] G

ROEN , D. ; H

ETHERINGTON , J. ; C

ARVER , H. B. ; N

ASH , R. W. ;B

ERNABEU , M. O. ; C

OVENEY , P. V.: Analysing and modelling theperformance of the HemeLB lattice-Boltzmann simulation environment.In:

Journal of Computational Science UO , Z. ; Z HENG , C. ; S HI , B. : Discrete lattice effects on the forcingterm in the lattice Boltzmann method. In: Physical Review E

65 (2002),Nr. 4, S. 046308[27] G

YSI , T. ; O

SUNA , C. ; F

UHRER , O. ; B

IANCO , M. ; S

CHULTHESS ,T. C.: STELLA: a domain-speciﬁc tool for structured grid methods inweather and climate models. In:

Proc. of the International Conferencefor High Performance Computing, Networking, Storage and Analysis

ACM, 2015, S. 41[28] H E , X. ; L UO , L.-S. : Lattice Boltzmann Model for the IncompressibleNavier–Stokes Equation. In: Journal of Statistical Physics

88 (1997), Nr.3/4, 927–944. http://dx.doi.org/10.1023/B:JOSS.0000015179.12689.e4. –DOI 10.1023/B:JOSS.0000015179.12689.e4. – ISBN 0022–4715[29] H

ENRETTY , T. ; V

ERAS , R. ; F

RANCHETTI , F. ; P

OUCHET , L.-N. ;R

AMANUJAM , J. ; S

ADAYAPPAN , P. : A stencil compiler for short-vectorSIMD architectures. In:

Proc. of the 27th international ACM conferenceon International conference on supercomputing

ACM, 2013, S. 13–24[30] H

EUVELINE , V. ; L

ATT , J. : THE OPENLB PROJECT: AN OPENSOURCE AND OBJECT ORIENTED IMPLEMENTATION OF LAT-TICE BOLTZMANN METHODS. In:

International Journal of Modern Physics C

18 (2007), Nr. 04, S. 627–634. http://dx.doi.org/10.1142/S0129183107010875. – DOI 10.1142/S0129183107010875[31] H

OLEWINSKI , J. ; P

OUCHET , L.-N. ; S

ADAYAPPAN , P. : High-performance code generation for stencil computations on GPU ar-chitectures. In:

Proc. of the 26th ACM international conference onSupercomputing

ACM, 2012, S. 311–320[32] H OU , S. ; S TERLING , J. ; C

HEN , S. ; D

OOLEN , G. D.: A LatticeBoltzmann Subgrid Model for High Reynolds Number Flows. In:

Fields Institute Communications

LUYVER , T. ; R

AGAN -K ELLEY , B. ; P ´

EREZ , F. ; G

RANGER , B. ;B

USSONNIER , M. ; F

REDERIC , J. ; K

ELLEY , K. ; H

AMRICK , J. ;G

ROUT , J. ; C

ORLAY , S. ; I

VANOV , P. ; A

VILA , D. ; A

BDALLA , S. ;W

ILLING , C. : Jupyter Notebooks – a publishing format for reproduciblecomputational workﬂows. In: L

OIZIDES , F. (Hrsg.) ; S

CHMIDT , B. (Hrsg.); IOS Press (Veranst.):

Positioning and Power in Academic Publishing:Players, Agents and Agendas

IOS Press, 2016, S. 87 – 90[34] K R ¨ UGER , T. ; K

USUMAATMAJA , H. ; K

UZMIN , A. ; S

HARDT , O. ;S

ILVA , G. ; V

IGGEN , E. M.:

The Lattice Bolzmann Method - Principlesand Practics . Springer, 2017. – ISBN 978–3–319–44649–3[35] K R ¨ UGER , T. ; V

ARNIK , F. ; R

AABE , D. : Shear stress in lattice Boltzmannsimulations. In:

Physical Review E - Statistical, Nonlinear, and SoftMatter Physics

79 (2009), Nr. 4, S. 1–14. http://dx.doi.org/10.1103/PhysRevE.79.046704. – DOI 10.1103/PhysRevE.79.046704. – ISBN1539–3755[36] L

AGRAVA , D. ; M

ALASPINAS , O. ; L

ATT , J. ; C

HOPARD , B. : Advancesin multi-domain lattice Boltzmann grid reﬁnement. In:

Journal ofComputational Physics

231 (2012), Nr. 14, S. 4808 – 4822. http://dx.doi.org/10.1016/j.jcp.2012.03.015. – DOI 10.1016/j.jcp.2012.03.015.– ISSN 0021–9991[37] LB3D: http://ccs.chem.ucl.ac.uk/lb3d . – accessed on 2020-01-21[38] L

ENGAUER , C. ; A

PEL , S. ; B

OLTEN , M. ; C

HIBA , S. ; R ¨

UDE , U. ;T

EICH , J. ; G R ¨ OSSLINGER , A. ; H

ANNIG , F. ; K ¨

OSTLER , H. ; C

LAUS , L.u. a.: ExaStencils–Advanced Multigrid Solver Generation. In:

Softwarefor Exascale Computing-SPPEXA 2nd phase (2020)[39] L IU , Z. ; C HU , X. ; L V , X. ; M ENG , H. ; S HI , S. ; H AN , W. ; X U ,J. ; F U , H. ; Y ANG , G. : Sunwaylb: Enabling extreme-scale latticeboltzmann method based computing ﬂuid dynamics simulations onsunway taihulight. In:

IEEE, 2019, S. 557–566[40] L

OGG , A. ; M

ARDAL , K.-A. ; W

ELLS , G. :

Automated solution ofdifferential equations by the ﬁnite element method: The FEniCS book .Bd. 84. Springer Science & Business Media, 2012[41] M

IERKE , D. ; J

ANSSEN , C. ; R

UNG , T. : An efﬁcient algorithm for the cal-culation of sub-grid distances for higher-order LBM boundary conditionsin a GPU simulation environment. In:

Computers & Mathematics withApplications. (2018). http://dx.doi.org/10.1016/j.camwa.2018.04.022. –DOI 10.1016/j.camwa.2018.04.022. – ISSN 0898–1221[42] O

PEN

LB: . – accessed on 2020-01-21[43] P

ALABOS : . – accessed on 2020-01-21[44] P OHL , T. ; K

OWARSCHIK , M. ; W

ILKE , J. ; I

GLBERGER , K. ; R ¨

UDE , U.: Optimization and proﬁling of the cache performance of parallel latticeBoltzmann codes. In:

Parallel Processing Letters

13 (2003), Nr. 04, S.549–560[45] R

ATHGEBER , F. ; H AM , D. A. ; M ITCHELL , L. ; L

ANGE , M. ; L

UPORINI ,F. ; M C R AE , A. T. ; B ERCEA , G.-T. ; M

ARKALL , G. R. ; K

ELLY ,P. H.: Firedrake: automating the ﬁnite element method by composingabstractions. In:

ACM Transactions on Mathematical Software (TOMS)

43 (2016), Nr. 3, S. 1–27[46] R

OMAN , S. : The formula of Faa di Bruno. In:

The AmericanMathematical Monthly

87 (1980), Nr. 10, S. 805–809[47] R ¨

UDE , U. ; W

ILLCOX , K. ; M C I NNES , L. C. ; S

TERCK , H. D.: Researchand education in computational science and engineering. In:

Siam Review

60 (2018), Nr. 3, S. 707–754[48] S

AILFISH : https://github.com/sailﬁsh-team/sailﬁsh . – accessed on 2020-01-21[49] S CHILLER { }

Documents/50/thesis[50] S

CHMIESCHEK , S. ; S

HAMARDIN , L. ; F

RIJTERS , S. ; K R ¨ UGER , T.; S

CHILLER , U. D. ; H

ARTING , J. ; C

OVENEY , P. V.: LB3D: Aparallel implementation of the Lattice-Boltzmann method for simulationof interacting amphiphilic ﬂuids. In:

Computer Physics Communications

17 (2017), S. 149–161. http://dx.doi.org/10.1016/j.cpc.2017.03.013. –DOI 10.1016/j.cpc.2017.03.013[51] S

CHORNBAUM , F. ; R ¨

UDE , U. : Massively parallel algorithms for thelattice Boltzmann method on nonuniform grids. In:

SIAM Journal onScientiﬁc Computing

38 (2016), Nr. 2, S. C96–C126[52] S

CHORNBAUM , F. ; R ¨

UDE , U. : Extreme-scale block-structured adaptivemesh reﬁnement. In:

SIAM Journal on Scientiﬁc Computing

40 (2018),Nr. 3, S. C358–C387[53] S

ILVA , G. ; S

EMIAO , V. : First-and second-order forcing expansionsin a lattice Boltzmann method reproducing isothermal hydrodynamicsin artiﬁcial compressibility form. In:

Journal of ﬂuid mechanics

ANG , Y. ; C

HOWDHURY , R. A. ; K

USZMAUL , B. C. ; L UK , C.-K.; L EISERSON , C. E.: The pochoir stencil compiler. In:

Proc. of thetwenty-third annual ACM symposium on Parallelism in algorithms andarchitectures

ACM, 2011, S. 117–128[55] TCLB: https://github.com/CFD-GO/TCLB . – accessed on 2020-01-21[56] T

REIBIG , J. ; H

AGER , G. ; W

ELLEIN , G. : Likwid: A lightweightperformance-oriented tool suite for x86 multicore environments. In:

IEEE, 2010, S. 207–216[57] W

ELLEIN , G. ; L

AMMERS , P. ; H

AGER , G. ; D

ONATH , S. ; Z

EISER , T.: Towards optimal performance for lattice Boltzmann applications onterascale computers. In:

Parallel Computational Fluid Dynamics 2005 .Elsevier, 2006, S. 31–40[58] W

ELLEIN , G. ; Z

EISER , T. ; H

AGER , G. ; D

ONATH , S. : On the singleprocessor performance of simple lattice Boltzmann kernels. In:

Computersand Fluids

35 (2006), Nr. 8-9, S. 910–919. http://dx.doi.org/10.1016/j.compﬂuid.2005.02.008. – DOI 10.1016/j.compﬂuid.2005.02.008. – ISBN0045–7930[59] W

ILKE , J. ; P

OHL , T. ; K

OWARSCHIK , M. ; R ¨

UDE , U. : Cacheperformance optimizations for parallel lattice Boltzmann codes. In:

European Conference on Parallel Processing

Springer, 2003, S. 441–450[60] W

ITTMANN , M. ; H

AAG , V. ; Z

EISER , T. ; K ¨

OSTLER , H. ; W

ELLEIN ,G. : Lattice Boltzmann benchmark kernels as a testbed for performanceanalysis. In:

Computers & Fluids

172 (2018), S. 582–592[61] W

ITTMANN , M. ; Z

EISER , T. ; H

AGER , G. ; W

ELLEIN , G. : Com-parison of different propagation steps for lattice Boltzmann methods.In: