[PDF] Fine-sorting One-dimensional Particle-In-Cell Algorithm with Monte-Carlo Collisions on a Graphics Processing Unit

Abstract

Particle-in-cell (PIC) simulations with Monte-Carlo collisions are used in plasma science to explore a variety of kinetic effects. One major problem is the long run-time of such simulations. Even on modern computer systems, PIC codes take a considerable amount of time for convergence. Most of the computations can be massively parallelized, since particles behave independently of each other within one time step. Current graphics processing units (GPUs) offer an attractive means for execution of the parallelized code. In this contribution we show a one-dimensional PIC code running on Nvidia GPUs using the CUDA environment. A distinctive feature of the code is that size of the cells that the code uses to sort the particles with respect to their coordinates is comparable to size of the grid cells used for discretization of the electric field. Hence, we call the corresponding algorithm "fine-sorting". Implementation details and optimization of the code are discussed and the speed-up compared to classical CPU approaches is computed.

Full PDF

FFine-sorting One-dimensional Particle-In-CellAlgorithm with Monte-Carlo Collisions on a GraphicsProcessing Unit

Philipp Mertmann a, ∗ , Denis Eremin b , Thomas Mussenbrock b , Ralf PeterBrinkmann b , Peter Awakowicz a a Institute for Electrical Engineering and Plasma Technology, Ruhr-Universit¨at Bochum,Universit¨atsstr. 150, 44801 Bochum, Germany, Tel: ++49 (0)234 3228062,Fax: ++49 (0)234 3214230 b Institute for Theoretical Electrical Engineering, Ruhr-Universit¨at Bochum

Abstract

Particle-in-cell (PIC) simulations with Monte-Carlo collisions are used in plasmascience to explore a variety of kinetic eﬀects. One major problem is the long run-time of such simulations. Even on modern computer systems, PIC codes take aconsiderable amount of time for convergence. Most of the computations can bemassively parallelized, since particles behave independently of each other withinone time step. Current graphics processing units (GPUs) oﬀer an attractivemeans for execution of the parallelized code. In this contribution we showa one-dimensional PIC code running on Nvidia R (cid:13) GPUs using the CUDA TM environment. A distinctive feature of the code is that size of the cells that thecode uses to sort the particles with respect to their coordinates is comparableto size of the grid cells used for discretization of the electric ﬁeld. Hence, wecall the corresponding algorithm “ﬁne-sorting”. Implementation details andoptimization of the code are discussed and the speed-up compared to classicalCPU approaches is computed. Keywords:

GPU, PIC, Particle-In-Cell, sorting algorithm, CUDA

1. Introduction

In the past years, the development of central processing units (CPUs) forconsumers changed from just increasing the clock frequency and integratingspecial functions into the core to the design of processors with multiple cores.Accordingly, the use of parallel computing becomes more and more important,even though most computer programs presently do not take advantage of thispossibility. One branch that optimized the parallelization perfectly and also ∗ Corresponding author

Email address: [email protected] (Philipp Mertmann)

Preprint submitted to Elsevier June 2, 2018 a r X i v : . [ phy s i c s . p l a s m - ph ] A p r ptimized their hardware for that purpose is the manufacturing industry ofgraphics cards. Modern 3-D applications allow for the parallel computation ofmany independent values, such as pixels on a display.Number of scientiﬁc eﬀorts that take advantage of the computational powerof graphics processing units (GPU) is growing. Recently, many publicationswere made where calculations beneﬁt from the speed of graphics processingunits [1, 2, 3, 4, 5, 6, 7]. Main convenience of this approach is a high speedmulti core processing unit that is comparatively low priced.Low pressure plasmas play a major role in a wide range of industrial applica-tions, including, among other things, semi-conductor processing, surface coatingor sterilization processes. Characteristics of these gas discharges, such as pres-sure, input power, reactor sizes, length scales of particle collisions or the typicallength of electric ﬁeld screening (Debye-length) span several orders of magni-tude, depending on the actual application. This makes modeling extremelydiﬃcult, because many assumptions or simpliﬁcations of a speciﬁc model areusually only applicable to a particular case.Fluid models, which consider velocity distributions functions (mostly Maxwellian)become improper under certain conditions, due to their nature of using integralquantities and principles like diﬀusion. PIC simulations [8, 9, 10] trace parti-cles (as electrons and ions) on their way inside a plasma, being impacted bycollisions, the electric ﬁeld and the walls of the reactor. Thereby, distributionfunctions of particles are approximated by PIC simulations. By averaging (timeor space integrations) over certain quantities, macroscopic and measurable val-ues such as particle density, ﬂux and current can be calculated. Since only afew assumptions are made, particle-in-cell (PIC) simulations retain most of thefundamental, nonlinear eﬀects in a plasma.In this article we present a PIC code parallelized for execution on a graphicscard, using Nvidia’s CUDA environment in C. In the second section we give ashort introduction into particle-in-cell simulations. In section 3 the algorithmis explained in detail. In section 4 we run the code under diﬀerent conditionsand obtain parameters for an optimized speed-up.

2. Particle-In-Cell Basics

Particle-in-cell simulations follow the movement of charged particles inside aplasma. Actually, every simulated (super) particle represents a certain amountof real particles. Particle locations and velocities are deﬁned in continuum space,whereas charge density, electric potential and ﬁeld are deﬁned on a spatial gridof size N grid . This work examines one spatial and three velocity components ofthe particles.Figure 1 shows the modules of the PIC algorithm, that are passed throughevery ﬁnite time step ∆ t . Beginning at the top, charge density is calculatedby weighting particles on the grid, using a linear interpolation (ﬁrst order)scheme in this work. After that, Poisson’s equation is solved and the electricﬁeld is calculated. Field values on the grid point are interpolated to particle2 ssign chargedensity to grid Solve Poisson'sequation and calculateelectric fieldAccelerate and moveparticlesRemove/insertparticles at theboundariesMonte-Carlo collisions Figure 1: Computing sequence of the particle-in-cell cycle with Monte-Carlo collisions. positions and then particles accelerate and move inside the simulation box.Particles that left the ﬁnite box are removed from the simulation every ∆ t . Ifsecondary electrons or electron reﬂection are featured, electrons are inserted atthe boundaries. Now collisions take place, handled by a Monte-Carlo moduleand the loop is closed.To render an actual plasma behavior, some conditions have to be satisﬁed.The ﬁnite grid spacing (distance between two grid points) has to be smaller thanthe Debye-length to resolve ﬁeld eﬀects on the correct length scale. Additionally,time step size has to be small enough to resolve the highest frequency processesin the system, which are oscillations with electron plasma frequency. Recentlyit was shown that the number of particles inside a Debye-sphere has to be largerthan expected until then to reduce or avoid spurious numerical eﬀects [11]. Allthis leads to an enormous amount of computational power, needed by suchsimulations.

3. Implementation Details

Nvidia’s

Compute Uniﬁed Device Architecture (CUDA) is an architecturefor parallel computations on graphics processing units. Programs are written inthe programming language C as regular C-code with some extensions to provideaccess to the GPU [12].Graphics cards have a number of parallel multiprocessors, depending on thegeneration and model of the GPU. A multiprocessor features a certain amountof processor cores, each having its own fast register memory, but the cores ofa multiprocessor also share the on-chip shared memory space. All diﬀerent3 orting cell 0 sorting cell 1 sorting cell N-1sortingcell iused unused

Figure 2: Fine-sorted particle array. Each sorting cell i has a ﬁxed size N max and is ﬁlledwith particle data up to a certain number N i . Information about the current status N i of asorting cell has to be stored in a separate (integer) array. multiprocessors have access to the large global memory . Additionally, there aretwo cached read-only memory spaces, constant memory and texture memory .Special functions, called kernels , run on the GPU in parallel, using thousandsor millions of independent threads . Threads are grouped into thread blocks ,whose size is chosen by the programmer. All threads of such a block run on thesame physical multiprocessor, thus having access to the same shared memoryspace. Constant and texture memory are read-only spaces for threads and canbe exclusively ﬁlled with data from outside a kernel by special functions. Particle data is stored in a ﬂoat4 array, using the x, y and z componentfor velocities and the w component for the location of a particle. Each particlespecies (such as electrons or ions) uses its own array. All arrays are dividedinto sorting cells, each of the same ﬁxed size (ﬁgure 2). Each sorting cell inGPU memory belongs to a certain region in conﬁguration space, accordinglyall particles within a sorting cell are close to each other. Information on howmany particles N i currently reside in a sorting cell is stored in an integer array,one integer for each cell. All sorting cells therefore are partially ﬁlled with thecurrently residing particles and the rest is free for new particles entering thecell.Such sorting cells in memory can contain multiple grid points of the particle-in-cell spatial grid, that deﬁnes the values of charge density, potential and ﬁeld.The number of grid points per sorting cell N gc can be conﬁgured by the user atthe beginning of a simulation and inﬂuences the speed.If a kernel is called, each sorting cell is handled by a single block of threads(and thus a single multiprocessor, ﬁgure 3). Using the information from theinteger array (current number of particles inside a sorting cell), threads of ablock can treat (e.g. push) all valid particles of a sorting cell and leave theremaining parts of the ﬂoat4 array unaﬀected.The advantage of a ﬁne-sorting code is, that this kind of code allows fora straightforward and fast implementation of nonlinear collision processes (i.e.Coulomb collisions) through the binary collisions, which is diﬃcult otherwise.4 hread block 0 thread block 1 thread block k multiprocessor 1 ... shared memory multiprocessor 2 ... shared memory multiprocessor j ... shared memory sorting cell 0 sorting cell 1 sorting cell N-1 Figure 3: Calling a kernel function in the ﬁne-sorted algorithm. Each cell is handled by asingle thread block and thus by a single multiprocessor.

In this case, size of the sorting cells must not exceed the Debye-length. Addi-tionally, other modules of the PIC cycle beneﬁt from the ﬁne-sorting approach,as stated in the following. A good overview of advantages and drawbacks ofdiﬀerent ways of GPU implementations can be found in [13]. It is worth notingthat unlike the codes on CPUs, such as [14], we did not observe any signiﬁcantimprovement of the code run-time compared to an alternative GPU algorithm,where particle data is unsorted, as will be described in our future work.

Equations are normalized to minimize the amount of additional factors andthereby reduce the computational eﬀort by the code. Finite diﬀerence integra-tion of the equations of motion lead to multiplications with the time step ∆ t , sonormalizing time t on this time step leads to the simplest case of a multiplicationwith 1 t → ∆ t · ˜ t Determination of the nearest grid point p i of a particle is an important partof all grid dependent calculations. Normalizing the spatial coordinate x on thelattice spacing ∆ x x → ∆ x · ˜ x involves a single type casting operation (ﬂoat to integer) to calculate p i .To simplify the charge density assignment, charge density ρ is normalizedon the charge density of a single super-particle ρ → QA e ∆ x · ˜ ρ, with Q , the charge of super-particle and A e the electrode area of the discharge.Electric potential Φ is normalized to ease Poisson’s equationΦ → Q ∆ x(cid:15) A e · ˜Φ , ... Figure 4: Ions (red) and electrons (blue) of one sorting cell only contribute to the chargedensity of grid points that belong to their cell (linear weighting). Each sorting cell contains 5grid points in this case. with (cid:15) , the the vacuum permittivity. Electric ﬁeld is normalized on E → Q(cid:15) A e · ˜ E and velocities are normalized on v → ∆ x ∆ t · ˜ v. Skipping all tildes for normalized values, the equations read ∂ Φ ∂x = − ρ ( r, z ) (1) E = − ∂ Φ ∂x (2) ∂v∂t = Q ∆ t (cid:15) ∆ xM A e E = κE (3) ∂x∂t = v (4)with M the mass of a super-particle. Before the kernel is started, the number of particles in each sorting cell N i is copied to constant memory. All threads of a block treat particles of a singlesorting cell. Therefore particles of one block only contribute to a small area inthe charge density array, namely N gc grid points belonging to that sorting cell(ﬁgure 4).Threads allocate a local array of N gc ﬂoats in register memory and initializethe array with zeros. Afterwards each thread starts loading a particle to registermemory, adding its charge to the local charge density array. The i -th thread6 orting cell i (particles) blockDim.xblockDim.x thread 0thread 1thread 2thread 3sum rho in register memoryof threads sharedMem[]sharedMem[]sharedMem[]sharedMem[]copy data toshared memory sharedMem[]sharedMem[] sharedMem[]reduction of data, number of active threads is halved every stepglobalMem[i] copy result to global memorysyncthreads Figure 5: Charge density calculation. Threads sum rho in register memory, exemplary shownfor only four threads and a single sorting cell i . The reduction is done using shared memory,the last active thread stores the result in global memory. of each block thereby executes the i -th particle of the sorting cell. If there aremore particles than the size of the block, this procedure is repeated in stridesequal to the block size as many times as needed.Once all particles are done, a reduction in shared memory starts. Eachthread copies its whole local array to a shared memory array. Now half of thethreads becomes inactive. Each active thread adds the charge density of aninactive thread to its own shared array. Again and again half of the threadsbecomes inactive, until only a single thread is left. This thread copies thereduced charge density of the current cell to global memory. The very leftand right grid points of a sorting cell are used by the cell and the neighboringcell, respectively. To avoid global memory atomics, the right grid point of eachsorting cell is stored separately in a ﬂoat array in global memory and is addedto the charge density in a subsequent kernel function. The whole procedure isshown in ﬁgure 5. Discretizing Eq.(1) with ﬁnite diﬀerences leads to a tridiagonal matrix. Adetailed description of including diﬀerent boundary conditions can be found in[15]. Solving can be done very eﬃciently serially on the CPU, using the Thomasalgorithm [16], an optimized Gaussian elimination scheme for tridiagonal ma-trices.Since charge density is calculated on the GPU, the array must be copied toCPU memory, Poisson’s equation is solved and the electric ﬁeld is calculated.After this, the electric ﬁeld array is copied to GPU constant memory. Despite7he amount of memory transfer between CPU and GPU, this module takes onlya small part of the run-time of the PIC cycle. For one million electrons and ions,respectively, and 800 grid points for the ﬁeld, data copy and solving takes about3 .

55% of the run-time of a cycle. Thus, a more complicated parallel approachthat runs completely on the GPU does not seem to be worth it for a one-dimensional simulation. For large number of grid points or a two-dimensionalapproach, this is of course not the case.

In this kernel, the current number of particles in each sorting cell N i andthe electric ﬁeld are read from constant memory. Each thread loads a particleto register memory and updates the particle velocity by multiplying the inter-polated ﬁeld with κ (Eq.(3)). After this, the new particle location is calculated(Eq.(4)), using the new velocity. This relates to the commonly used leap-frogalgorithm, where location and velocity are separated in time by half of a timestep [9, 8]. v ( t + 0 . − v ( t − .

5) = κE ( x ( t )) x ( t + 1) − x ( t ) = v ( t + 0 . After each pushing, some particles reside in incorrect sorting cells. Since theCourant-condition [9, 8] needs to be satisﬁed, this is only a small fraction ofparticles. v e < ∆ x ∆ t < N gc ∆ x ∆ t (5) v e is the thermal velocity of electrons, so inequality (5) basically means, thatonly a small number of particles leaves its sorting cell during one time step.Accordingly, sorting is optimized for a “nearly sorted” particle array.In a ﬁrst kernel, threads copy incorrect particles to free positions at the endof their current sorting cell. This memory space serves as a buﬀer for sortingparticles (see Fig. (6)). Since particles of one sorting cell are checked in parallelby threads of a single block, all threads have access to the same shared memory.Hence, a counter N index, i in shared memory can provide indices for the newparticle positions at the end of the sorting cell. If a thread spots an invalidparticle, it decreases N index, i by one, using the atomicSub function and therebygets the new index.After this, each sorting cell has gaps (blue symbols in Fig.(6)) which have tobe ﬁlled with valid particles. Threads scan particles a second time and ﬁll thegaps with information from the last valid particle in each sorting cell, respec-tively (Fig.(7)). Therefore a second counter in shared memory allocates correct8 orting cell i used unused Figure 6: Step 1a of the sorting algorithm for sorting cell i . Rejected particles (blue symbols)move to free positions (red crosses) at the end of sorting cell i . sorting cell i used unused rejected Figure 7: Step 1b of the sorting algorithm for sorting cell i . Gaps are closed with validparticles from current sorting cell i . ortingcell isorting cell i-1 sortingcell i+1 Figure 8: Step 2 of the sorting algorithm. Threads of block i copy rejected particles from theneighboring sorting cells to cell i . particle’s indices at position N i , N i − N i are storedin integer arrays in global memory.The second step of sorting starts in a new kernel. Again, a thread blockis started for each sorting cell i . Now, half of the threads checks the rejectedparticles of the left neighbor i −

1, the other half checks the right neighbor i + 1 (ﬁgure 8). A counter in shared memory deﬁnes the new particle index insorting cell i and is incremented with the atomicAdd function. Particles fromthe neighboring sorting cells are just read, so no atomic operations are neededin global memory. The updated number of valid particles is stored in globalmemory.The last kernel of the sorting algorithm has to identify rejected particles,that are not sorted into the correct sorting cells yet. These are rare particlesthat moved farther than one sorting cell. Threads of a thread block now checkrejected particles of its own sorting cell. If such a fast particle is located, itsinformation is copied to the correct sorting cell (Fig.(9)). For assigning indicesto the particle, a counter in global memory is needed. This is the sole exceptionfor the use of atomic functions in global memory. Since threads from all blocksaccess the same indices this step has to be done in global instead of sharedmemory. Again, the number of particles is stored in global memory. After this,10 c isc i-2sc i-3 sc i+2sc i+3 Figure 9: Step 3 of the sorting algorithm. Threads of block i copy rejected particles fromtheir own sorting cell (sc) to other cells using atomic functions in global memory for the indexcounter. Monte-Carlo collisions are treated by a modiﬁed null collision method. Adetailed description of the standard algorithm can be found in [17, 18].Here, null collision frequency ν is calculated once in the usual way, butinstead of picking N = (cid:0) − e − ∆ t ν (cid:1) N p = P N p (6)particles randomly from the total N p number of particles, a random number p is drawn for every particle. This number is compared with P , the null collisionprobability. For p < P the regular null collision algorithm is started, that iscalculating collision cross sections and probabilities for the current particle. Atan average N particles run through the standard null collision algorithm, whichsatisﬁes equation 6. Using this modiﬁed null collision method, global memoryatomics can be avoided, because every particle is just treated by one individualthread. If a particle is created due to ionization, it is injected into the sortingcell of the colliding electron. Thus, a counter in shared memory can allocateindices for those new particles.Consequently, each thread needs its own random number generator (RNG).In principal, every RNG that uses only a single seed is suﬃcient, because mem-ory access for initializing the RNG has to be small for a fast implementation.In this work we use a 3-xor-shift (11,7,12) generator [19], which can producerandom numbers very quickly on the GPU. An unsigned integer array allocatesseeds. Each thread loads its own seed from global memory, generates a certainnumber of random values and stores the last seed back to global memory.CPU simulations usually use look-up tables for collision cross sections. Onthe GPU, simple functions, using only products and sums, can be much fasterthan scattered reads from memory. Therefore, cross sections for argon are cal-culated directly, using approximations of measured values [20, 21, 22].Even though this implementation creates lots of branches, it is much fasterthan CPU collision handling.

4. Optimizing the Parameters and Speed-up Measurements

The algorithm sorts particles into sorting cells, which can contain any num-ber of grid points of the charge density or rather the electric ﬁeld array (ﬁgure4). At the smallest possible size, there are just two grid points per sorting cell,one on the left, one on the right. The number of particles inside each sorting cellis low and all kernels have to be called for a large number of cells. It is obvious,that overhead due to calling functions and kernels slows down this approach.On the other hand, the reduction algorithm is not suitable for a large numberof grid points per sorting cell N gc and also the amount of register memory canbe limiting. Thus, the GPU may run slower than for smaller sorting cells. Also,there is an upper boundary on N gc governed by the requirement that sorting cell12 r un t i m e pe r t i m e s t ep / m s grid points per sorting cell Figure 10: Execution time of a single cycle for diﬀerent numbers of grid points per sorting cell N gc . Simulation runs with 500,000 ions and electrons, respectively at a pressure of 10 Pa. size does not exceed the Debye-length, in case one wants to implement Coulombcollisions.Between these to extremes one can suspect an optimized version. Figure10 shows the run-time of a single cycle of the simulation as a function of N gc .In fact, a minimum can be found for N gc = 3. Note, that for some cases theminimum is found at N gc = 4, depending on the generation of GPU and allinput values of the simulation. CUDA allows for choosing the size of blocks with a maximum of 512 or 1024threads per block, for compute capability 1.2 and 2.0 hardware, respectively. Inmany applications, large blocks with many parallel threads are the best choice.Since CUDA hides memory access latencies of a thread by handling other threadsmeanwhile, small blocks are usually not the best option. For diﬀerent inputs ablock size of 128 had the best performance. Figure 11 shows the run-time overthe block size for a system of 500,000 ions and electrons, respectively. Theseresults were obtained with the same block size for all diﬀerent kernels, but alsoevery single module showed a similar behavior.13 r un t i m e pe r t i m e s t ep / m s block size Figure 11: Execution time of a single cycle for diﬀerent block sizes. Simulation runs with500,000 ions and electrons, respectively at a pressure of 10 Pa. r un t i m e pe r t i m e s t ep / m s number of grid points Figure 12: Execution time of a single cycle for diﬀerent electrical grid sizes N grid . Simulationruns with 1,000,000 ions and electrons, respectively at a pressure of 10 Pa. In this section the number of grid points N grid of the electric ﬁeld latticeis changed. The total number of particles and the number of grid points persorting cell N gc remain constant. Figure 12 shows a nearly linear dependence ofthe execution time on N grid . By increasing N grid , also the CUDA kernel grid sizerises and a larger number of CUDA blocks has to be called. All dependenciesare linear, small variances in ﬁgure 12 can be explained by changing number ofparticles per sorting cell. This behavior is diﬀerent from CPU implementations,where run-time is nearly independent of the electric ﬁeld grid size. For comparison to classical CPU approaches, all diagnostics in CPU codexpdp1 [23] were removed, so only the plain PIC algorithm remains. For testingissues we use a GTX480 GPU and a single core of an Intel i7 870 CPU runningat 2.93 GHz. These two devices are not only comparable with regard to up-to-dateness when establishing this study, but also in respect of their costs. Forthe test case of 500.000 ions and electrons (respectively) at a pressure of 20 Pa,a single time step takes about 13 .

70 ms on the CPU and about 1 .

09 ms on theGPU. 15 ⋅ ⋅ ⋅ ⋅ ⋅ s peed − up number of particles1 Pa20 Pa100 Pa Figure 13: Speed-up of the ﬁne-sorted PIC algorithm, compared to a CPU (single core)approach as a function of the number of super-particles for diﬀerent pressures. N grid = 512. N total the system behaves as expected: low number of particles can not uti-lize the GPU fully, so increasing N total also enhances the speed-up. For lowernumbers of particles (in the range of a few thousands), overhead from callingfunctions on the GPU and data transfer between CPU and GPU memory be-comes more and more important. Hence the CPU code can become faster thanthe GPU for very small systems. Since PIC statistics improve with larger sys-tems [11], we attended to larger systems in our study. Functions in ﬁgure 13are not saturated in respect of N total , so for very large systems even higherspeed-ups than 20 can be expected.Increasing the pressure from 1 Pa to 20 Pa slows down the GPU algorithmfor high values of N total . This is not obvious, since higher collision rates resultin less branches and thereby advantages for the GPU. At a pressure of 1 Pa,Monte-Carlo collisions only take a negligible fraction of a cycle’s execution time,so do not inﬂuence the speed-up. For increasing pressure, the collision modulebecomes more and more important. For a pressure of 20 Pa the speed-up of justthe Monte-Carlo collisions is about 11.5 for a system of 1,000,000 electrons andions, respectively. Now the total speed-up for 1 Pa (about 15.2) is higher thanthis value, so if collisions become more important, the whole speed-up decreases.An increase in pressure from 20 Pa to 100 Pa makes the simulation slightlyfaster. Again, the collisions become more important regarding the run-time ofa cycle, but this time the collision module is more than 19 times faster than the1 Pa case. This results in an overall increase in speed.As on the CPU, most of the time is spend for all particle related tasks,pusher, charge density assignment and for high pressure also the Monte-Carlomodule. Fine-sorting of particles of course also takes a considerable amount oftime.

5. Summary and Conclusion

A ﬁne-sorting particle-in-cell code, running on a single graphics processingunit was implemented, using Nvidia’s CUDA environment. For testing we usednewest generation GPU and CPU. All in all for a normal range of input pa-rameters, speed-ups of about 10-20 compared to a classical CPU code could beobserved. A major advantage of the ﬁne-sorting algorithm is that it allows fora straightforward implementation of binary collisions.The code is not bounded to any special (small) grid size. Thus, algorithmscan be used for two-dimensional PIC codes, except for the ﬁeld solving. Dueto large numbers of particles and an eﬃcient two-dimensional ﬁeld solver, muchhigher speed-ups than in the one-dimensional case can be expected.17 . Acknowledgment

Financial support for P M from the Research School of the Ruhr-UniversityBochum and for D E from Deutsche Forschungsgemeinschaft within the SFB-TR 87 and FOR 1123 is gratefully acknowledged. We thank the Plasma Theoryand Simulation Group in Berkeley for providing access to their code. We par-ticularly thank Prof. A Z Panagiotopoulos as well as Dr. S Barr from Princetonuniversity chemical engineering for sharing their GPU system.

References [1] Joshua A. Anderson, Chris D. Lorenz, and A. Travesset. General purposemolecular dynamics simulations fully implemented on graphics processingunits.

Journal of Computational Physics , 227(10):5342 – 5359, 2008.[2] Andreu Badal and Aldo Badano. Accelerating monte carlo simulations ofphoton transport in a voxelized geometry using a massively parallel graph-ics processing unit.

Medical Physics , 36(11):4878–4880, 2009.[3] R. Kersevan and J.-L. Pons. Introduction to molﬂow+: New graphical pro-cessing unit-based monte carlo code for simulating molecular ﬂows and forcalculating angular coeﬃcients in the compute uniﬁed device architectureenvironment.

J. Vac. Sci. Technol. , 27(4):1017–1023, 2009.[4] Amos G. Anderson, William A. Goddard III, and Peter Schrder. Quantummonte carlo on graphical processing units.

Computer Physics Communica-tions , 177(3):298 – 306, 2007.[5] Priyadarshini Rajasekaran, Philipp Mertmann, Nikita Bibinov, DirkWandke, Wolfgang Vil, and Peter Awakowicz. Dbd plasma source operatedin single-ﬁlamentary mode for therapeutic use in dermatology.

Journal ofPhysics D: Applied Physics , 42(22):225201 (8pp), 2009.[6] M. J. Harvey and G. De Fabritiis. An implementation of the smooth particlemesh ewald method on gpu hardware.

Journal of Chemical Theory andComputation , 5(9):2371–2377, 2009.[7] Andrey Asadchev, Veerendra Allada, Jacob Felder, Brett M. Bode, Mark S.Gordon, and Theresa L. Windus. Uncontracted rys quadrature implemen-tation of up to g functions on graphical processing units.

Journal of Chem-ical Theory and Computation , 6(3):696–704, 2010.[8] C K Birdsall and A B Langdon.

Plasma Physics via Computer Simulation .Taylor and Francis Group, LLC, 2005.[9] R W Hockney and J W Eastwood.

Computer Simulation using Particles .IOP Publishing Ltd, 1988.[10] T Tajima.

Computational Plasma Physics . Westview Press, 2004.1811] M M Turner. Kinetic properties of particle-in-cell simulations compromisedby monte carlo collisions.

Physics of Plasmas , 13(3):033506, 2006.[12] NVidia.

CUDA Programming Guide , 3.1 edition, 7 2010.[13] George Stantchev, William Dorland, and Nail Gumerov. Fast parallelparticle-to-grid interpolation for plasma pic simulations on the gpu.

Journalof Parallel and Distributed Computing , 68(10):1339 – 1349, 2008. General-Purpose Processing using Graphics Processing Units.[14] D. Tskhakaya and R. Schneider. Optimization of pic codes by improvedmemory management.

Journal of Computational Physics , 225(1):829 – 839,2007.[15] J P Verboncoeur, M V Alves, V Vahedi, and C K Birdsall. Simultaneouspotential and circuit solution for 1d bounded plasma particle simulationcodes.

J. Comput. Phys. , 104(2):321–328, 1993.[16] S D Conte and C De Boor.

Elementary Numerical Analysis; an AlgorithmicApproach . McGraw-Hill, New York, 1972.[17] J P Verboncoeur. Particle simulation of plasmas: review and advances.

Plasma Physics and Controlled Fusion , 47(5A):A231–A260, 2005.[18] C K Birdsall. Particle-in-cell charged-particle simulations, plus monte carlocollisions with neutral atoms, pic-mcc.

Plasma Science, IEEE Transactionson , 19(2):65 –85, apr 1991.[19] G Marsaglia. Xorshift rngs.

Journal of Statistical Software , 8(14):1–6, 72003.[20] A V Phelps. Cross sections and swarm coeﬃcients for nitrogen ions andneutrals in n2 and argon ions and neutrals in ar for energies from 0.1 evto 10 kev.

Journal of Physical and Chemical Reference Data , 20:557–573,May 1991.[21] A. V. Phelps. The application of scattering cross sections to ion ﬂux modelsin discharge sheaths.

Journal of Applied Physics , 76(2):747–753, 1994.[22] A Yanguas-Gil, J C, and L L Alves. An update of argon inelastic crosssections for plasma discharges.