On the design of Monte-Carlo particle coagulation solver interface: a CPU/GPU Super-Droplet Method case study with PySDM
SSuper-Droplet Kernels - backend-level routinesfor Monte-Carlo particle coagulation solvers:API proposal with CPU and GPUimplementation in Python (cid:63)
Piotr Bartman − − − and Sylwester Arabas − − − Jagiellonian University, Krak´ow, Poland [email protected], [email protected]
Abstract.
Super-Droplet Method (SDM) is a probabilistic Monte-Carlo-type model of particle coagulation process, an alternative to the mean-field formulation of Smoluchowski. SDM as an algorithm has linear com-putational complexity with respect to the state vector length, the statevector length is constant throughout simulation, and most of the algo-rithm steps are readily parallelisable. This paper discusses the designand implementation of two number-crunching backends for SDM im-plemented in PySDM, a new free and open-source Python package forsimulating the dynamics of atmospheric aerosol, cloud and rain parti-cles (https://github.com/atmos-cloud-sim-uj/PySDM). The two back-ends share their application programming interface (API) but leveragedistinct parallelism paradigms, target different hardware, and are builton top of different lower-level routine sets. First offers multi-threadedCPU computations and is based on Numba (using Numpy arrays). Sec-ond offers GPU computations and is built on top of ThrustRTC and CU-RandRTC (and does not use Numpy arrays). The paper puts forward aproposal of the Super-Droplet Kernels API featuring data structures andbackend-level routines (computational kernels) suitable for performingthe computational steps SDM consists of. Presented discussion covers:data dependencies across steps, parallelisation opportunities, CPU andGPU implementation nuances, and algorithm workflow. Example simu-lations suitable for validating implementations of the API are presented.
Keywords:
Monte-Carlo · coagulation · Super-Droplet Method · GPU.
The Super-Droplet Method (SDM) introduced in [18] is a computationally effi-cient Monte-Carlo type algorithm for modelling the process of collisional growth(coagulation) of particles. SDM was introduced in the context of atmosphericmodelling, in particular for simulating the formation of cloud and rain (hence (cid:63)
Funding: Foundation for Polish Science (POIR.04.04.00-00-5E1C/18-00). a r X i v : . [ phy s i c s . c o m p - ph ] J a n P. Bartman & S. Arabas the algorithm name) through particle-based simulations. Such simulations cou-ple a grid-based computational fluid-dynamics (CFD) core with a probabilistic,so-called super-particle (hence the algorithm name), representation of the partic-ulate phase, and constitute a unique tool for comprehensive modelling of aerosol-cloud interactions (see, e.g., [3,13]). The probabilistic character of SDM is em-bodied, among other aspects, in the assumption of each super-particle represent-ing a multiple number of modelled droplets with the same attributes (includingparticle size and position in space). The super-droplets are thus a coarse-grainedview of droplets both in physical space and attribute space.The probabilistic description of collisional growth has a wide range of ap-plications across different domains of computational sciences (e.g., astrophysics,aerosol/hydrosol technology including combustion). While focused on SDM anddepicted with atmospheric phenomena examples, the material presented herein isgenerally applicable in development of software implementing other Monte-Carlotype schemes for coagulation (e.g., Weighted Flow Algorithms [8] and other, seeSect. 1 in [19]), particularly when sharing the concept of super-particles.The original algorithm description [18], the relevant patent applications (e.g.,[20]) and several subsequent works scrutinising SDM (e.g., [2,17,9,23] expoundedupon the algorithm characteristics from users’ (physicists’) point of view. Theaim of this work is to discuss the algorithm from software developer’s perspec-tive. To this end, the scope of the discussion covers: data dependencies, paral-lelisation opportunities, state vector structure and helper variable requirements,minimal set of computational kernels needed to be implemented and the overallalgorithm workflow. Throughout the paper, these concepts are depicted withpseudo-code-mimicking Python snippets (syntactically correct and runnable),but the solutions introduced are not bound to a particular language. In con-trast, the gist of the paper is the language-agnostic API (i.e., application pro-gramming interface) proposal put forward with the very aim of capturing theimplementation-relevant nuances of the algorithm which are, arguably, tricky todiscern from existing literature on SDM, yet which have profound impact onthe simulation performance. The API provides an indirection layer separatinghigher-level physically-relevant concepts from lower-level computational kernels.Validity of the proposed API design has been demonstrated with two dis-tinct backend implementations included in a newly developed simulation packagePySDM [5]. The two backends share the programming interface, while differingsubstantially in the underlying software components and the targeted hardware(CPU vs. GPU). PySDM is free and open source software, presented results arereadily reproducible with examples included in the v1.1 release of PySDM.The remainder of this paper is structured as follows. Section 2 briefly in-troduces the SDM algorithm through a juxtaposition against the alternativeclassic Smoluchowski’s coagulation equation (SCE). Section 3 defines the pro-posed backend API. Section 4 presents simulations performed with the PySDMpackage based on benchmark setups from literature and used to depict the algo-rithm scalability and CPU vs. GPU performance. Section 5 concludes the workenumerating the key points brought out in the paper. uper-Droplet Kernels 3
Population balance equation which describes collisional growth is historicallyknown as the Smoluchowski’s coagulation equation (SCE) and was introducedin [22,21] (for a classic and recent overviews, see e.g. [7] and [14], respectively).Noteworthy, it is formulated under the mean-field assumptions of sufficientlylarge system and of neglected correlations between numbers of droplets of dif-ferent sizes (for discussion, see [9]).Let function c ( x, t ) : R + × R + → R + correspond to particle size spectrumand describe the average concentration of particles of size x at time t in a well-mixed volume of air V . Smoluchowski’s coagulation equation describes evolutionof particle size spectrum in time due to collisions. For convenience, t is skippedin notation: c ( x ) = c ( x, t ), while ˙ c denotes partial derivative with respect totime. ˙ c ( x ) = 12 (cid:90) x a ( y, x − y ) c ( y ) c ( x − y ) dy − (cid:90) ∞ a ( y, x ) c ( y ) c ( x ) dy (1)where a ( x , x ) is the so-called kernel which defines the rate of collisions (andcoagulation) between particles of sizes x and x and a ( x , x ) = a ( x , x ). Thefirst term on the right-hand side is the production of particles of size x by coales-cence of two smaller particles and the factor / is for avoiding double counting.The second term represents the reduction in number of colliding particles dueto coagulation.The Smoluchowski’s equation has an alternative form that is discrete in sizespace. Let x be the smallest considered difference of size between particles, x i = ix , i ∈ N and c i = c ( x i ) then:˙ c i = 12 i − (cid:88) k =1 a ( x k , x i − k ) c k c i − k − ∞ (cid:88) k =1 a ( x k , x i ) c k c i (2)Analytic solutions to the equation are known only for simple kernels ([14]),such as: constant a ( x , x ) = 1, additive a ( x , x ) = x + x (Golovin’s kernel)or multiplicative a ( x , x ) = x x . Taking atmospheric physics as an example,collisions of droplet within cloud occur by differentiated movements of particlescaused by combination of gravitational, electrical, or aerodynamic forces, wheregravitational effects dominate. As such, sophisticated kernels are needed to de-scribe these phenomena, and hence numerical methods are required for solvingthe coagulation problem. However, when multiple properties of particles (suchas volume, temperature, concentration of chemical compounds, etc.) need to betaken into account, the numerical methods for SCE suffer from the curse of di-mensionality due to the need to distinguish particles of same size x but differentproperties.Additionally, it is worth to highlight that, in practice, the assumptions of theSmoluchowski equation may be difficult to meet. First, the particle size changes P. Bartman & S. Arabas at the same time due to processes other than coalescence (condensation/evapo-ration). Second, it is assumed that the system is large enough and the dropletsinside are uniformly distributed, which in turn is only true for a small volumein the atmosphere. Moreover, using the Smoluchowski’s equation that describesevolution of the mean state of the system, leads to deterministic simulations.The alternative to Smoluchowski equation are Monte-Carlo methods based onstochastic model of the collision-coalescence process. Embracing stochastic ap-proach enables simulating ensembles of realisations of the process aiding studiesof rare (far-from-the-mean) phenomena like rapid precipitation onset (see dis-cussion in [13]).Despite the above limitations, the Smoluchowski’s equation and the analyti-cal solutions of SCE are important for validation of alternative methods and areused for this purpose herein.
All mentioned problems related to Smoluchowski’s equation were the reasons tolook for alternative methods. The stochastic coalescence model for cloud dropletgrowth was already analysed by Gillespie in [11]. There were however still chal-lenges (described below) related to computational and memory complexity whichwere solved and applied by Shima et al. in [18] (and extended to cover clouds fea-turing ice phase of water in [19]). The Monte-Carlo numerical scheme introducedin [18] is referred to as the Super-Droplet Method (SDM).Tracking all particles in large simulations has immeasurably high computa-tional cost, which is why the notion of super-particles is introduced, but stillin many cases it can be treated in the same way as real particles. For conve-nience, let Ξ part and Ξ sd be indexed sets of real-droplets and super-droplets,respectively. Each element of these sets has unique index i ∈ [0 , , ..., n part − i ∈ [0 , , ..., n sd − n sd ) n part is a number of considered (super)droplets. Attributes related to a specific (super) droplet i are denoted by attr [ i ] .Example droplet attributes are: position, volume v of droplet, volume of sub-stances dissolved in particle.In the probabilistic particle-based approach, each super-droplet represents amultiple number of droplets with the same attributes (including position) and isassigned with a multiplicity value denoted by ξ ∈ N and which can be differentin each super-droplet. In SDM, multiplicity ξ [ i ] of super-droplet of size x [ i ] isconceptually related to c i = c ( x i ) from equation (2).The number of super-droplets must be sufficient to cover the phase space ofparticle properties, and the higher the number, the smaller the multiplicities andthe higher fidelity of discretisation (see discussion in [18,19]).The algorithm steps, as defined in [18], can be outlined as follows and table 1summarises the main differences between the SCE and SDM: Step 1: cell-wise shuffling
In SCE, all considered droplets are compared with each other and a fraction ofdroplets of size x [ i ] collides with a fraction droplets of size x [ j ] in each time step. uper-Droplet Kernels 5 In contrast, in SDM a random set of (cid:98) n sd / (cid:99) non-overlapping pairs is consideredin each time step, where n sd is the number of super-droplets.To ensure linear computational complexity, the mechanism for choosing ran-dom non-overlapping pairs must also have at most linear computational com-plexity. It can be obtained by permuting an indexed array of super-dropletswhere each pair consist of super-droplets with indexes (2 j ) and (2 j + 1) where j ∈ [0 , , , . . . , n sd ) (see appendix A in [18]). Step 2: collision probability evaluation
If a fraction of droplets is allowed to collide, a collision of two super-droplets canproduce third super-droplet which has different size. To represent such collisionevent, a new super-droplet would need to be added to the system what leadsto: increasing the number of super-droplets in the system, smaller and smallermultiplicities and increasing memory demands during simulation. Moreover, itis hard to predict at the beginning of a simulation how much memory is needed(see discussion in [15]). Adding new super-droplets can be avoided by mappingcolliding super-droplets to a multi-dimensional grid (e.g., 2D for particle volumeand its dry size), simulating coalescence on this grid using SCE-based approachand discretising the results back into Lagrangian particles as proposed in [1].Such approach however, not only entails additional computational cost, but mostimportantly is characterised by the very curse of dimensionality and numericaldiffusion that were meant to be avoided by using particle-based technique.SDM avoids the issue of unpredictable memory complexity stemming fromthe need of allocating space for freshly collided super-particle of size that differsfrom the size of the colliding ones. To this end, in SDM only collisions of all ofmin { ξ [ i ] , ξ [ j ] } droplets are considered, and thus collision of two super-dropletsalways produces 1 or 2 super-droplets. This means there is no need for addingextra super-droplets to the system during simulation.For each pair, probability of collision p is determined by up-scaled to num-ber of real droplets represented by the super-droplet with higher multiplicity(max { ξ [ i ] , ξ [ j ] } ). As a result, computational complexity is O ( n sd ) instead of O ( n sd ) what is a significant gain.The evaluated probability of collision of super-droplets requires further up-scaling due to the reduced amount of sampled candidate pairs as outlined inStep 1 above. To this end, the probability is multiplied by the ratio of thenumber of all non-overlapping pairs ( n sd − n sd ) / (cid:98) n sd / (cid:99) .For each selected pair of j -th and k -th super-droplets, the probability p ofcollision of the super-droplets within time interval ∆t is thus evaluated as: p = a ( v [ j ] , v [ k ] ) max { ξ [ j ] , ξ [ k ] } ( n sd − n sd ) / (cid:98) n sd / (cid:99) ∆t∆V (3)where a is a coalescence kernel (e.g. based on collision cross-section geometry)and ∆V is the considered volume (of air) assumed to be mixed well enough toneglect the positions of particles when evaluating the probability of collision. P. Bartman & S. Arabas
Table 1.
Comparison of SCE and SDM approaches to modelling collisional growth.
SCE (mean-field) SDM (probabilistic)considered pairs all (i,j) pairs random set of n sd / n sd − n sd ) / n sd / O ( n sd ) O ( n sd )collisions colliding a fraction of ξ [ i ] , ξ [ j ] collide all of min { ξ [ i ] , ξ [ j ] } (all or nothing)collisions triggered every time step by comparing probability with a random number Step 3: collision triggering and attribute updates
In the spirit of Monte-Carlo methods, the collisions are triggered with a randomnumber φ γ ∼ U nif orm [0 , γ (per timestep) is: γ = (cid:100) p − φ γ (cid:101) (4)with γ ∈ N . Noting that the rate γ can be greater than 1, a further adjustmentis introduced with the aim of representing multiple collision of super-particles:˜ γ = min { γ, (cid:98) ξ [ j ] /ξ [ k ] (cid:99)} (5)where ξ [ j ] > ξ [ k ] was assumed (without losing generality).Coalescence of two droplets results in creation of a new droplet with newattributes. Two kinds of attributes are distinguished: extensive A ex ∈ R n ex andintensive A in ∈ R n in , and n ex , n in are the numbers of extensive and intensiveattributes, respectively. During coalescence, values of extensive attributes of col-lided droplets add up. In the intensive case, the result is a volume-weightedaverage of pre-collision values. Examples of extensive attributes are: volume,dry volume or heat content; examples of intensive attributes are: temperatureor concentration of chemical compounds. Values of attributes of super-dropletsafter collision are denoted by hat. There are two cases:1. ξ [ j ] − ˜ γξ [ k ] > ξ [ j ] = ξ [ j ] − ˜ γξ [ k ] ˆ A ex [ j ] = A ex [ j ] ˆ A in [ j ] = A in [ j ] ˆ ξ [ k ] = ξ [ k ] ˆ A ex [ k ] = A ex [ k ] + ˜ γA ex [ j ] ˆ A in [ k ] = A in [ k ] v [ k ] + ˜ γA in [ j ] v [ j ] v [ k ] + ˜ γv [ j ] (6)2. ξ [ j ] − ˜ γξ [ k ] = 0 ˆ ξ [ j ] = (cid:98) ξ [ k ] / (cid:99) ˆ A ex [ j ] = ˆ A ex [ k ] ˆ A in [ j ] = ˆ A in [ k ] ˆ ξ [ k ] = ξ [ k ] − (cid:98) ξ [ k ] / (cid:99) ˆ A ex [ k ] = A ex [ k ] + ˜ γA ex [ j ] ˆ A in [ k ] = A in [ k ] v [ k ] + ˜ γA in [ j ] v [ j ] v [ k ] + ˜ γv [ j ] (7)If in eq. (7) multiplicity ˆ ξ [ j ] = 0, the j -th super-droplet is removed fromthe system (happens only if, as a result of a collision, one single real dropletis created). The conceptual view of collision of two super-droplets is intuitivelydepicted in Fig.1 in [18]. uper-Droplet Kernels 7 The proposed API is composed of four data structures (classes) and 9 libraryroutines (involving computational kernels). Description below outlines both thegeneral, implementation-independent, structure of the API, as well as selectedaspects pertaining to the experience from implementing CPU and GPU back-ends in the PySDM package. These were built on top of the Numba [16] andThrustRTC [25] Python packages, respectively. Numba is an just-in-time (JIT)compiler for Python code, it features extensive support for Numpy constructsand features multi-threading constructs akin to the OpenMP infrastructure.ThrustRTC uses the NVIDIA CUDA real-time compilation infrastructure of-fering high-level Python interface for execution of both built-in and customcomputational kernels on GPU.In multi-dimensional simulations coupled with a grid-based CFD fluid flowsolver, the positions of droplets within the physical space are used to split thesuper-particle population among grid cells (CFD solver mesh). Since positionsof droplets change based on the fluid flow and droplet mass/shape characteris-tics, the particle-cell mapping changes throughout the simulation. Collisions areconsidered only among particles belonging to the same grid cell, and the varyingnumber of particles within a cell is needed to be tracked at each timestep toevaluate the super-particle collision rates. Figure 1 outlines the setting.
Fig. 1.
Schematic of how a CFD grid is populated with immersed super-particles.
The
Storage class is a base container which is intended to adapt the interfaces ofthe underlying implementation-dependent array containers (in PySDM: Numpyor ThrustRTC containers for CPU and GPU backends, respectively). This re-moves the dependency of the underlying storage layer in the code utilising theSuper-Droplet Kernels API.The
Storage class has 3 attributes: data (in PySDM: an instance of Numpy ndarray or an instance of ThrustRTC
DVVector ), shape (which specifies sizeand dimension) and dtype (data type: float or int ). The proposed API employsone- and two-dimensional arrays, implementations of the Storage class featurean indirection layer handling the multi-dimensionality in case the underlyinglibrary supports one-dimensional indexing only (as in ThrustRTC).
P. Bartman & S. Arabas idx = Index(N_SD , int) multiplicities = IndexedStorage (idx , N_SD , int) intensive = IndexedStorage (idx , (N_IA , N_SD), float) extensive = IndexedStorage (idx , (N_EA , N_SD), float) volume_view = extensive [0:1 , :] cell_id = IndexedStorage (idx , N_SD , int) cell_idx = Index(N_CELL , int) cell_start = Storage ( N_CELL + 1, int) pair_prob = Storage (N_SD //2, float) pair_flag = PairIndicator (N_SD , bool) u01 = Storage (N_SD , float) Fig. 2.
Simulation state example with N SD super-particles, N CELL grid cells, N IAintensive and N EA extensive attributes.
Storage handles memory allocation and optionally the host-device (CPU-accessible and GPU-accessible memory) data transfers. Equipping
Storage withan override of the [ ] operator as done in PySDM can be helpful for debuggingand unit testing (in Python,
Storage instances may then be directly used with max() , min() , etc.), care needs to be taken to ensure memory-view semantics fornon-debug usage, though. Explicit allocation is used only (once per simulation),The IndexedStorage subclass of
Storage is intended as container for super-particle attributes. In SDM, at each step of simulation a different order of par-ticles needs to be considered. To avoid repeated permutations of the attributevalues, the
Index subclass of
Storage is introduced. One instance of
Index isshared between
IndexedStorage instances and is referenced by the idx field.The
Index class features permutation and array shrinking logic (allowing forremoval of super-droplets from the system). To generate permutation of particles,it is enough to permute
Index only. To handle multi-dimensional simulations,
Index features sort-by-key logic where a cell id attribute is used as the key.The
PairIndicator class provides an abstraction layer facilitating pairwiseoperations inherent to several steps of the SDM workflow. In principle, it rep-resents a Boolean flag per each super-particle indicating weather in the currentstate (affected by random shuffling and physical displacement of particles), agiven particle is considered as first-in-a-pair. Updating
PairIndicator , it mustbe ensured that the next particle according to a given
Index is the second onein a pair – i.e., resides in the same grid cell. The
PairIndicator is also used toseamlessly handle odd and even counts of super-particles within a cell.Figure 2 lists a minimal set of instances of the data structures constitutingan SDM-based simulation state and elaborated on in the following section.
The algorithm workflow, encoded in Python using the proposed data structures,and divided into the same algorithm steps as outlined in Sec. 2 is given in Fig. 3.An additional Step 0 is introduced to account for handling the removal of zero-multiplicity super-particles at the beginning of each timestep. uper-Droplet Kernels 9
The cell_id attribute represents particle-cell mapping. Furthermore, the cell_idx helper
Index instance can be used to specify the order in which gridcells are traversed – for parallel execution scheduling purposes. Both cell_id and cell_idx are updated before entering into the presented block of code.Step 1 logic begins with generation of random numbers used for randompermutation of the particles (in PySDM, CURandRTC [24] is used for parallelexecution). It serves as a mechanism for random selection of super-particle pairsin each grid cell. First, a shuffle-per-cell step is done in which the cell_start array is updated to indicate the location within cell-sorted idx where sets ofsuper-particles belonging to a given grid cell start. Second, the pair_flag indi-cator is updated using the particle-cell mapping embodied in cell_start .It is worth noting that the shuffle-per-cell operation can be implementedusing either (i) a domain-global permutation followed by a global sort-by-cell(used for GPU in PySDM) or (ii) a cell-local permutation in which case theglobal sort-by-cell operation is done beforehand (used for CPU in PySDM). Thestrategy depends on the computational complexity of the employed permutationand sorting algorithms (solutions such as MergeShuffle [4] are to be tested). Forthe sort-by-cell operation, PySDM features a parallel counting sort with linearcomplexity.Instructions in Step 2 and Step 3 blocks correspond to the evaluation ofsubsequent terms of equation 3 and the collision event handling. To exemplify theway the GPU and CPU backends are engineered in PySDM, the implementationof the times-max routine with ThrustRTC, and the update-attributes routinewith Numba, respectively, are given in Fig. 4. remove_if_equal_0 (idx , multiplicities ) urand(u01) shuffle_per_cell (cell_start , idx , cell_idx , cell_id , u01) flag_pairs (pair_flag , cell_id , cell_idx , cell_start ) coalescence_kernel (pair_prob , pair_flag , volume_view ) times_max (pair_prob , multiplicities , pair_flag ) normalize (pair_prob , dt , dv , cell_id , cell_idx , cell_start ) urand(u01 [: N_SD //2]) compute_gamma (pair_prob , u01 [: N_SD //2]) update_attributes ( multiplicities , intensive , extensive , volume_view , pair_prob ) Fig. 3.
Algorithm workflow within a timestep, in/out comments mark argument intent.0 P. Bartman & S. Arabas _kernel = ThrustRTC .For ([’data_out ’, ’perm_in ’, ’pair_flag ’], "i", ’’’ if ( pair_flag [i]) data_out [i/2] *= max( perm_in [i], perm_in [i + 1]); ’’’) def _launch (length , data_out , data_in , pair_flag , idx): perm_in = trtc. DVPermutation (data_in , idx) _kernel . launch_n (length , [data_out , perm_in , pair_flag ]) def times_max (this , other , pair_flag ): _launch (len(other.idx), this.data , other.data , pair_flag . indicator .data , other.idx.data) @numba .njit( parallel =True , error_model =’numpy ’) def _update_attributes (length , n, intensive , extensive , volume , idx , gamma): for i in prange ( length //2): j = idx [2*i] k = idx [2*i + 1] if n[j] < n[k]: j, k = k, j g = min(int(gamma[i]), int(n[j] / n[k])) if g == 0: continue new_n = n[j] - g * n[k] if new_n > 0: n[j] = new_n for attr in range (0, len( intensive )): intensive [attr , k] = ( intensive [attr , k] * volume [k] + intensive [attr , j] * volume [j] * g ) / ( volume [k] + g * volume [j]) for attr in range (0, len( extensive )): extensive [attr , k] += g * extensive [attr , j] else: n[j] = n[k] // 2 n[k] = n[k] - n[j] for attr in range (0, len( intensive )): intensive [attr , j] = ( intensive [attr , k] * volume [k] + intensive [attr , j] * volume [j] * g ) / ( volume [k] + g * volume [j]) intensive [attr , k] = intensive [attr , j] for attr in range (0, len( extensive )): extensive [attr , j] = extensive [attr , j] * g \ + extensive [attr , k] extensive [attr , k] = extensive [attr , j] def update_attributes (n, intensive , extensive , volume , gamma): _update_attributes (len(n.idx), n.data , intensive .data , extensive .data , volume .data , n.idx.data , gamma.data) Fig. 4.
GPU (top panel) and CPU (bottom panel) routine implementation examples.
This section outlines a set of test cases useful in validating implementation ofthe proposed API. First, a simulation constructed analogously as that reportedin Fig. 2 in the original SDM paper [18] is proposed. The aim is to corrobo-rate results obtained with SDM against Golovin’s [12] analytical solution to theSmoluchowski’s equation for an additive a ( x , x ) = x + x coalescence kernel.Results obtained with PySDM are presented in Figure 5. uper-Droplet Kernels 11 Figure 6 documents simulation wall times for the above test case measured asa function of number of super-particles employed. Wall times for CPU (Numba)and GPU (ThrustRTC) backends are compared depicting a five-fold GPU-to-CPU speedup for large state vectors (tested on commodity hardware: Intel Corei7-10750H CPU and NVIDIA GeForce RTX 2070 Super Max-Q GPU). Further-more, the expected asymptotic linear scaling is hinted.Figure 7 exemplifies simulation with a more realistic coalescence kernels (seefigure caption). Since the analytic solution is not known in such case, the resultsare juxtaposed with figures reproduced from a classic paper on the subject [6].In all plots, results of the SDM are plotted by aggregating super-particlemultiplicities onto a grid of ca. 100 bins, and smoothing the result twice witha running average with a centred five-bin window. particle radius [µm] d m / d l n r [ g / m ^ /( un i t d r / r )] t = 0.0st = 1200.0st = 2400.0st = 3600.0s Fig. 5.
Particle mass spectrum: SDM re-sults (colour) vs. analytic solution (black). number of particles w a ll t i m e [ s ] ThrustRTCNumba
Fig. 6.
Simulation wall time as a functionof state vector size. particle radius [µm] d m / d l n r [ g / m ^ /( un i t d r / r )] t = 0st = 100st = 200st = 300st = 400st = 500st = 600st = 700st = 800st = 900st = 1000s t = 0st = 100st = 200st = 300st = 400st = 500st = 600st = 700st = 750st = 800st = 850s particle radius [µm] d m / d l n r [ g / m ^ /( un i t d r / r )] Fig. 7.
Left panels show Figs. 5 and 8 from [6]. Right panels depict solutions obtainedwith PySDM. Top: gravitational kernel; bottom: kernel modelling electric field effect.2 P. Bartman & S. Arabas
This paper has introduced the
Super-Droplet Kernels API – a set of datastructures and computational kernels constituting a number-crunchingbackend API for the Super-Droplet Method Monte-Carlo algorithm for repre-senting collisional growth of particles (coagulation) . Employment of theAPI assures separation of concerns, in particular separation of parallelisationlogic embedded in the backend-level routines from domain logic pertinent to thealgorithm workflow. This improves code readability, paves the way for modulardesign and testability, which all contribute to maintainability of the code base.The presented SDM algorithm and API descriptions discern data dependen-cies across the steps of the algorithm (including in/out parameter “intent”) andhighlight parallelisation opportunities in different steps. Most of the discernedsteps of the SDM algorithm are characterised by some degree of freedom in termsof their implementation. Embracing the API shall help in introducing a depen-dency injection mechanism allowing unit testing of selected steps and profilingperformance with different variants of implementation.The design of the API has been proved, in the sense of reused abstrac-tion principle, within the PySDM project [5] where two backends sharing theAPI offer contrasting implementations for CPU and GPU computations .Both backends are implemented in Python, however: (i) they are targeting dif-ferent hardware (CPU vs. GPU), (ii) they are based on different underlyingtechnology (Numba: LLVM-based JIT compiler, and ThrustRTC: NVRTC-basedruntime compilation mechanism for CUDA), and (iii) they even do not share thede-facto-standard Python Numpy arrays as the storage layer. This highlightsthat the introduced API is not bound to particular implementation choices, andin particular its design is applicable to other languages than Python .At the same time, the case of PySDM is an apt example of how the Pythonlanguage can be used for high performance computational modelling withouttrading off the salient features of Python : succinct syntax – the pseudo-code snippets presented in the paper are actuallyvalid Python code; seamless portability depicted in PySDM with continuous integration work-flows for Linux, macOS and Windows, including 32-bit and 64-bit runs; unmatched interoperability depicted in PySDM with Matlab and Julia us-age examples which do not require any additional biding logic within PySDM; multifaceted ecosystem including vibrant market of cloud-computing solu-tions and depicted in PySDM with one-click execution of Jupyter notebookson mybinder.org and colab.research.google.com platforms; availability of tools for modern hybrid hardware depicted in PySDM withthe GPU backend.It is worth noting that the above features play well together. For instance, theGPU backend of PySDM featuring the pseudo-code-like API can be leveragedthrough Jupyter notebooks in the cloud, as well as from within Matlab code ondifferent operating systems. uper-Droplet Kernels 13 Bringing the discussion back to the characteristics of the SDM algorithm, itis worth pointing out that, in the super-particle CFD-coupled simulations context SDM was introduced and gained attention in, the scheme is particularly well suited for leveraging modern hybrid CPU-GPU hardware . First,the algorithm is (almost) embarrassingly parallel. Second, the CPU-GPU trans-fer overhead is not a bottleneck when GPU-resident dispersed phase represen-tation (super-particles) is coupled with CPU-computed CFD for the continuousphase (fluid flow) as only statistical moments of the size spectrum of the particlesis needed for the CFD coupling (see [2]). Third, the CPU and GPU resourceson hybrid hardware can be leveraged effectively as the CFD and super-particlescomponents of the simulation system can run simultaneously (see [10]).Overall, the proposed API prioritises simplicity and was intentionally pre-sented in a paradigm-agnostic pseudo-code-mimicking way, leaving such aspectsas object orientation up to the implementation. Moreover, while the presentedAPI includes data structures and algorithm steps pertinent to multi-dimensionalCFD grid coupling, presented examples featured zero-dimensional setups, forbrevity. In PySDM, the backend API is extended to handle general form ofcoalescence kernels, representation of particle displacement and condensationalgrowth of particles, and the examples shipped with the package include multi-dimensional simulations. SDM algorithm is applicable in other domains of com-putational modelling, and the scope of the presented material covered the aspects relevant for coagulation modelling for a wide range of phenomena . Acknowledgements
We thank Shin-ichiro Shima and Michael Olesik for helpful discussions. Thanksare due to Numba and ThrustRTC developers for responding to our questions,bug reports and feature requests posted during development of PySDM.
References
1. Andrejczuk, M., Grabowski, W.W., Reisner, J., Gadian, A.: Cloud-aerosol interac-tions for boundary layer stratocumulus in the lagrangian cloud model. J. Geophys.Res. Atmos. (2010), https://doi.org/10.1029/2010JD0142482. Arabas, S., Jaruga, A., Pawlowska, H., Grabowski, W.W.: libcloudph++ 1.0:a single-moment bulk, double-moment bulk, and particle-based warm-rain mi-crophysics library in C++. Geosci. Model Dev. (2015), https://doi.org/10.5194/gmd-8-1677-20153. Arabas, S., Shima, S.: Large-eddy simulations of trade wind cumuli using particle-based microphysics with monte carlo coalescence. J. Atmos. Sci. (2013), https://doi.org/10.1175/JAS-D-12-0295.14. Bacher, A., Bodini, O., Hollender, A., Lumbroso, J.: Mergeshuffle: A very fast,parallel random permutation algorithm (2005), https://arXiv.org/abs/1508.031675. Bartman, P., et al.: (2021), https://github.org/atmos-cloud-sim-uj/PySDM6. Berry, E.: Cloud droplet growth by collection. J. Atmos. Sci. (1966), https://doi.org/1520-0469(1967)024 (cid:104) (cid:105)(cid:105)