High-performance parallel computing in the classroom using the public goods game as an example
HHigh-performance parallel computing in the classroom using the public goods game as an example
Matjaˇz Perc
1, 2, ∗ Faculty of Natural Sciences and Mathematics, University of Maribor, Koroˇska cesta 160, SI-2000 Maribor, Slovenia CAMTP – Center for Applied Mathematics and Theoretical Physics,University of Maribor, Mladinska 3, SI-2000 Maribor, Slovenia
The use of computers in statistical physics is common because the sheer number of equations that describethe behavior of an entire system particle by particle often makes it impossible to solve them exactly. MonteCarlo methods form a particularly important class of numerical methods for solving problems in statisticalphysics. Although these methods are simple in principle, their proper use requires a good command of statisticalmechanics, as well as considerable computational resources. The aim of this paper is to demonstrate how theusage of widely accessible graphics cards on personal computers can elevate the computing power in MonteCarlo simulations by orders of magnitude, thus allowing live classroom demonstration of phenomena that wouldotherwise be out of reach. As an example, we use the public goods game on a square lattice where two strategiescompete for common resources in a social dilemma situation. We show that the second-order phase transition toan absorbing phase in the system belongs to the directed percolation universality class, and we compare the timeneeded to arrive at this result by means of the main processor and by means of a suitable graphics card. Parallelcomputing on graphics processing units has been developed actively during the last decade, to the point wheretoday the learning curve for entry is anything but steep for those familiar with programming. The subject is thusripe for inclusion in graduate and advanced undergraduate curricula, and we hope that this paper will facilitatethis process in the realm of physics education. To that end, we provide a documented source code for an easyreproduction of presented results and for further development of Monte Carlo simulations of similar systems.
Keywords: Monte Carlo method, parallel computing, public goods game, graphics processing unit
I. INTRODUCTION
The use of computers to solve problems in statisticalphysics has a long and fruitful history, dating as far back asthe Manhattan Project, where analog computers were used sofrequently they often broke down. Digital computers, such asthe ENIAC (Electronic Numerical Integrator and Computer),were intertwined with nuclear science from its onset onwards.In fact, one of the first real uses of ENIAC was by EdwardTeller, who used the machine in his early work on nuclear fu-sion reactions [1]. Today, computers are used in practicallyall areas of physics, and it is indeed difficult to imagine sci-entific progress without them. As rightfully pointed out byNewman and Barkema [2], Monte Carlo methods form thelargest and the most important class of numerical methods forsolving statistical physics problems. Not surprisingly, in ad-dition to fascinating original research dating back more thanthree decades [3–5], the subject is covered in reviews [6–8]and many books [2, 9–11] in varying depth.While the famous Ising model [12, 13] is the workhorsebehind many introductory as well as less introductory textson the subject [2, 14], and is likely the most thoroughly re-searched model in the whole of statistical physics [15], wehere use another example for two reasons. In the first place,how to adapt the classical Monte Carlo algorithm for the Isingmodel to become fit for parallel computing on a graphics pro-cessing unit has already been demonstrated by Preis et al.[16]. In fact, parallel computing has already been applied toa number of other statistical physics problems, such as to the (1 + 1) dimensional surface growth model [17], to the Kardar- ∗ Electronic address: [email protected]
Parisi-Zhang model [18–20], to simulate stochastic differen-tial equations [21], Brownian motors [22], stochastic pro-cesses [23], and Arnold diffusion [24], as well as to study fer-rofluids [25], photon migration [26], anomalous coarsening inthe disordered exclusion process [27], and probability-basedsimulations in general [28]. Secondly, it might be welcome toadd a little color to the curriculum by expanding on the clas-sical subjects and thus to increase the popularity of physicswith the students, although the public goods game has beena fixture in statistical physics research for at least a decade[29, 30].The public goods game that we use here as an example isoften studied in the realm of evolutionary game theory as theparadigmatic case of a social dilemma [31–35]. The blueprintof the game is simple. The public goods game is played ingroups, wherein each member of the group can choose be-tween two strategies. If a member chooses to cooperate, itcontributes a fixed amount to the common pool ( c = 1 ). Con-versely, if a player chooses to defect, it contributes nothingto the common pool ( c = 0 ). The contributions from all thecooperators within a group are summed together and multi-plied by a so-called synergy factor r > . The latter takesinto account the added value of a group effort. Lastly, thesum total of public goods after the multiplication is dividedequally among all group members, and this regardless of theirstrategies. It is thus straightforward to see that an individualmember is best of if it chooses to defect, because it can enjoythe same benefits as cooperators whilst contributing nothing.However, if everybody chooses to defect the factor r multi-plies zero and the public goods are lost to all. The latter sce-nario is often referred to as the tragedy of the commons [36].Hence the classical social dilemma is given, where what isbest for an individual is at odds with that is best for the groupor the society as a whole. The question is under which condi- a r X i v : . [ phy s i c s . e d - ph ] A p r tions cooperation can nevertheless evolve.Methods of statistical physics have recently been appliedto subjects that, in the traditional sense, could be consideredas out of scope. Statistical physics of social dynamics [37],of evolutionary games in structured populations [29, 38, 39],of crime [40], and of epidemic processes and vaccination[41, 42], are all recent examples of this exciting development.And the evolution of cooperation in the realm of the publicgoods game is no exception [30, 43]. Especially the consid-eration of the public goods game in a structured population iswithin the domain of statistical physics [44–47]. In the sim-plest case, a structured population is described by the squarelattice, whereon cooperators can form compact clusters andcan thus avoid, at least those in the interior of such clusters,being exploited by defectors [48].When studying the public goods game on a square lattice,Monte Carlo simulations are used for random sequential strat-egy updating. This ensures that the treatment is aligned withfundamental principles of statistical mechanics, and it enablesa comparison of obtained results with generalized mean-fieldapproximations [49–52] as well as a proper determination ofphase transitions between different stable strategy configura-tions. However, such monte Carlo simulations require sig-nificant computational resources, especially if the size of thelattice is large, and if the system is close to a phase transi-tion where fluctuations are strong. It is thus of interest toutilize parallel computing that is nowadays possible on manygraphic cards installed in personal computers. Here we use theNVIDIA graphic card GeForce GTX 1080 and the CUDA pro-gramming environment [53]. The latter is designed to workwith all main programming languages, including C that weuse, as well as C++, C II. CUDA PROGRAMMING ENVIRONMENT
The CUDA programming environment is freely availableand upon installation integrates itself seamlessly with existingcompilers available in the operating system. It includes theCUDA compiler, math libraries, as well as tools for debuggingand performance optimization. The documentation availableonline at [53] is comprehensive and clear, so the reader is ad-vised to look for details there.The CUDA extension to the C programming language in-troduces new keywords and expressions that enable the userto distinguish between variables and functions that are storedin RAM and execute on the CPU (the host), and variables and functions (typically called kernels) that are stored and exe-cuted on the graphics processing unit (the device). Likewise,special keywords and expressions are available to transfer databetween the host and the device.The graphics processing unit that supports CUDA is com-posed of streaming multiprocessor, each of which is furthercomposed of several scalar processors. Each streaming mul-tiprocessor has different memory types available to it, namelya set of 32-bit registers, a shared memory block, as well asglobal memory. The 32-bit registers are fastest and smallest,while the global memory is largest and slowest. A scalar pro-cessor can access only its own register, each scalar processorin a particular streaming multiprocessor can access its ownshared memory block, while the global memory is accessibleto all scalar processors in all streaming multiprocessor (hencethe whole device) as well as to the host. An important partof efficient parallelization of the problem involves minimiz-ing the need to access global memory, and to make full use ofthe registers and the shared memory block. Although respect-ing the memory hierarchy is key to a fully optimized solution,significant improvements in computing power are attainableeven if the memory is handled less than optimally.From the programming point of view, each problem needsto be split into parts that can then be processed in parallel bythreads. Threads form blocks of threads, and blocks furtherform grids of blocks. The total number of threads is thus thenumber of threads in each block times the numbers of blocksin the grid. Each thread is executed by a scalar processor,and each block of threads is assigned to a particular streamingmultiprocessor. All threads within a block can thus access thepreviously mentioned shared memory block. The code that isexecuted within each thread is called a kernel, and each threadhas a unique ID that is determined based on the block and gridstructure. These are the basic concepts of parallel computingthat quickly become familiar to those that decide to implementit.
III. THE SPATIAL PUBLIC GOODS GAME
We use the spatial public goods game [30, 44, 54] todemonstrate the parallelization of the Monte Carlo algorithmwithin the CUDA programming environment. The game isstaged on a L × L square lattice with periodic boundary con-ditions where L players are arranged into overlapping groupsof size G = 5 , such that everyone is connected to its fournearest neighbors. As schematically illustrated in Fig. 1, eachplayer x = 1 , . . . , L is therefore member in g = 1 , . . . , G different groups. Players that cooperate ( s x = C ) con-tribute c = 1 into the common pool, while players that defect( s x = D ) contribute nothing. The sum of all contributionswithin a group is multiplied by r > and then divided equallyamongst all group members regardless of their strategies. Ina group g containing G players, of which N C cooperate, theresulting payoffs for cooperators and defectors are thus Π gC = G − rcN C − c and (1) Π gD = G − rcN C , (2) FIG. 1: Schematic illustration of the overlapping groups of size G = 5 around the player s x , which is denoted with an open cir-cle in the middle of the figure. The group where player s x is centraland surrounded by its four nearest neighbors on the square lattice ismarked with a thick blue line. The same player s x is also memberin four other groups, which are denoted by thinner green lines. Dueto the overlap of the groups the green lines denoting each particulargroup overlap as well. At a particular instance of the public goodsgame, the player s x obtains a payoff Π gs x from each of the depicted g = 1 , . . . , G groups. The overall payoff of player s x obtained at aparticular instance of the game is thus Π s x = (cid:80) g Π gs x . respectively. Evidently, the payoff of a defector is alwayslarger than the payoff of a cooperator, if only r < G . Witha single parameter, the public goods game hence captures theessence of a social dilemma in that defection yields highestshort-term individual payoffs, while cooperation is optimal forthe group, and in fact for the society as a whole. The overallpayoff Π s x of a player x from all the g = 1 , . . . , G groups issimply the sum Π s x = (cid:80) g Π gs x .Monte Carlo simulations of the described public goodsgame are carried out as follows. Initially each player on site x of the L × L square lattice is designated either as a cooperator( s x = C ) or defector ( s x = D ) with equal probability. Thefollowing elementary steps are subsequently repeated in a ran-dom sequential manner. A randomly selected player x playsthe public goods game as a member of all the g = 1 , . . . , G groups, thereby obtaining the payoff Π s x . Next, one of thefour nearest neighbors of player x is chosen uniformly at ran-dom, and this player y acquires its payoff Π s y in the sameway. Finally, player x copies the strategy s y of its randomlychosen nearest neighbor with the probability determined bythe Fermi function W ( s y → s x ) = 11 + exp[(Π s x − Π s y ) /K ] , (3)where K quantifies the uncertainty by strategy adoptions[44, 55]. In the K → limit, player x copies the strategyof player y if and only if Π s y > Π s x . Conversely, in the K → ∞ limit, payoffs seize to matter and strategies change asper flip of a coin. Between these two extremes players with ahigher payoff will be readily imitated, although the strategy ofunder-performing players may also be occasionally adopted, for example due to errors in the decision making, imperfectinformation, and external influences that may adversely affectthe evaluation of an opponent. Repeating all the describedelementary steps L times constitutes one full Monte Carlostep (MCS), thus giving a chance to every player to change itsstrategy once on average.As the main observable, we determine the average fractionof cooperators as ρ C = < L L (cid:88) x =1 d x >, (4)where d x = 1 if s x = C and d x = 0 otherwise, and < . . . > indicates average over time in the stationary state. A suffi-ciently long relaxation time needs to be discarded prior to this.In general, the stationary state is reached once ρ C becomes in-dependent of the time interval over which it is determined. IV. PARALLELIZATION OF THE MONTE CARLOSIMULATION METHOD
While the implementation of the Monte Carlo simulationmethod described in Section III is straightforward, its paral-lelization requires care in that we have to make certain thatthreads that will process different parts of the lattice do not si-multaneously change strategies of the same players, and thatthe strategies of players do not change whilst the determina-tion of the payoffs takes place. The remedy lies in partition-ing the lattice as shown in Fig. 2, which was originally pro-posed in [56] for parallel kinetic Monte Carlo simulations ofthin film growth, and subsequently used also in [18] for sim-ulating the Kardar-Parisi-Zhang model and the kinetic MonteCarlo model. The approach is known as the double tiling de-composition scheme, effectively partitioning the lattice into T = L / equally sized and independent domains of type , , and (see Fig. 2), which during a full Monte Carlo stepshould be updated in turn. Accordingly, T = L / gives usthe number of threads needed in total within the graphics pro-cessing unit, with each thread being assigned to one domain.Note that a full Monte Carlo step is split into four parts, thefirst part updating players within all domains , the secondpart updating players within all domains , the third part up-dating players within all domains , and finally the fourth partupdating players within all domains . But since all the do-mains of a given type are independent in that they do not sharea player or even a border, they can be updated in parallel by T = L / threads.However, by looking at the schematic display of groups inFig. 1, it becomes clear that a randomly selected player x atthe border of a particular domain will need some informationof player strategies also from adjacent domains. Even more soif the potential source of the new strategy, player y , will be se-lected on the other side of the border. We remind that player y can be any of the four nearest neighbors, and a player x at theborder of the domain will have one player as the nearest neigh-bor in the other domain (players in the corners will in fact havetwo nearest neighbor in two other domain). The memory that FIG. 2: Schematic illustration of the double tiling decompositionscheme on a × square lattice, where each domain comprises × players. These equally sized and independent domains aremarked with squares of different line style and color, such that { , , , } = { C, M, Y, K } . Each full Monte Carlo step is effec-tively split into four parts, the first part updating players within allcyan domains, the second part updating players within all magentadomains, the third part updating players within all yellow domains,and finally the fourth part updating players within all key domains.Importantly, for this decomposition scheme to work out, the linearsize of the whole lattice L needs to be exactly divisible by two timesthe linear size of each domain ( in this case). There are exactly T = L / domains of a particular color on the whole lattice, andthe strategies of the players within these domains can be updated inparallel by T threads. How the number of these threads is distributedwithin blocks, and how many such blocks are needed to form thegrid of the graphics processing unit depends on the architecture ofthe graphics card. We have chosen each block to contain × threads, which is related to the warp size on current graphics cardsand the size of the shared memory block per streaming multiprocess. needs to be passed to a particular thread must thus contain notonly the strategies within a domain, but also the strategies ofplayers three lines outward in each direction. This is schemat-ically illustrated in Fig. 3. The reader can convince herselfthat this is in fact the case by following the composition ofgroups depicted in Fig. 1 along the border of a particular do-main whilst assuming that player y is chosen from the otherside of the border. Thankfully, this requirement does not in-terfere with independent parallel random sequential updatingin the other domains of the same color in Fig. 2 if only the sizeof the domains is × or larger.This is essentially all there is to the parallelization. Techni-cally, we thus split the lattice into T = L / adjacent domainsof four different types, update all domains of the same typein parallel, and do so consecutively over all the type to com-plete one full Monte Carlo step. The only technical hiccup FIG. 3: Schematic illustration of the memory that needs to be avail-able to a thread for the strategies of players within the × red do-main to be updated correctly. The memory block needed is markedwith a thinner orange line. If the strategies of all the players withinthe orange domain are available to the thread, then whichever player x within the × red domain is randomly selected to potentiallycopy the strategy from one of its randomly chosen nearest neighbors y , this ensures that both payoffs Π s x and Π s y entering the Fermifunction given by Eq. (3) are determined in full. To be precise, thestrategies of the players encircled with dashed violet triangles in eachcorner of the orange domain are actually not needed, which can beverified if the overlapping groups schematically illustrated in Fig. 1are superimposed on y players laying outside of the × red do-main (one such example is shown for clarity). For the simplicity andefficiency of the source code, it is however not worthy of implemen-tation to filter these players out of the memory block. is to make sure information on player strategies is availablealso three player lines outward in each direction. In our par-ticular case, we have chosen the domain size S to be × (exactly as displayed in Fig. 3), which together with the aux-iliary memory still allows all strategy to be stored in the fastshared memory block if the number of threads within a block B is × . The latter choice, on the other hand, is condi-tioned on the fact that one streaming multiprocessor simulta-neously executes a so-called warp of threads in the largemajority of today’s graphic cards. Given these choices, thenumber of blocks within the grid can be determined accord-ing to G = T / ( BS ) . We note that, depending on L , G mightnot be an integer. In fact, it will be only when L is exactlydivisible by . In all other cases G should simply be thesmallest integral value not less than the originally calculated G . In this case one ends up with more threads then needed, butit is easy to discard those with an if statement. For example,for L = 800 and the domain size S = 16 , it is clear that theactual number of threads needed is L / S = 10 . But withthe “number of threads within a block” B = 64 constrain, thenumber of blocks within the grid will be G = 157 , in turnyielding B = 10048 > threads available.Lastly, we note that the selection of the × domain sizeand the partitioning of the lattice linearly in sequences of ( , , , , , . . . or , , , , , . . . ) obviously requires that thelinear size of the lattice L be divisible by lest the decompo- FIG. 4: Second-order C + D → D phase transition and the con-firmation of the directed percolation universality class conjecturein the public goods game on the square lattice. The main panelshows the fraction of cooperators ρ C in the stationary state in de-pendence on the distance to the critical value of the multiplicationfactor r c = 3 . in log-log scale. The black line indicates theslope . characteristic for directed percolation, while a linear fitto the data yields β = 0 . . The inset shows the fraction of co-operators ρ C in the stationary state in dependence on the value of themultiplication factor r . It can be observed that the C + D → D phase transition from right to left as r decreases is continuous. Bluecircles show that the nature of both phase transitions is distorted ifstaying at a small × size across the whole span of r values,thus corroborating the need for larger lattices near phase transitionpoints. sition will not be perfect. We again refer to Fig. 2 for details.The source code that implements the above described paral-lelization is available as supplementary material to this paperas well as at github.com/matjazperc/pgg. V. RESULTS
As explained in Section III when introducing the payoffsof the public goods game, if r < G the payoff of a defec-tor is always larger than the payoff of a cooperator. Accord-ingly, r = G is the threshold that marks the transition betweendefection and cooperation in well-mixed populations, wheregroups are formed by selecting players uniformly at random.In structured populations, however, due to the so-called net-work reciprocity [57], cooperators are able to survive at mul-tiplication factors that are well below the r = G limit thatapplies to well-mixed populations. The manifestation of net-work reciprocity relies on pattern formation, such that coop-erators form compact clusters and can thus avoid exploitationby defectors [48]. In short, cooperators do better if they aresurrounded by other cooperators.Previous research has shown that for the spatial publicgoods game on the square lattice with K = 0 . the thresh-old for cooperators to survive is r > . [44]. Also impor- tantly, it was shown that the public goods game on the squarelattice exhibits continuous phase transitions that belong to thedirected percolation universality class [58], such that ρ s x ∝ | p − p c | β (5)where p c is the critical parameter value at which the absorb-ing phase is reached and β = 0 . is the critical exponent[7]. We remind the reader that the directed percolation uni-versality class conjecture requires that [59, 60] (i) the modeldisplays a continuous phase transition from a fluctuating ac-tive phase into a unique absorbing phase, (ii) the transitionis characterized by a positive one-component order parameter(in our case ρ C ), (iii) the dynamic rules involve only short-range processes (yes due to the square lattice), and (iv) thesystem has no special attributes such as additional symmetriesor quenched randomness.We can use this conjecture that fully applies to the presentlystudied public goods game as validation for the paralleliza-tion of the Monte Carlo method presented in Section IV. Asshown in Fig. 4, the phase transition leading from the mixed C + D phase at r > . to the absorbing D phase as r de-creases is indeed continuous (see inset), and it belongs to thedirected percolation universality class with r c = 3 . and β = 0 . (main panel). This confirms the previouslypublished results in [44, 58] obtained with traditional CPU-based computing, and it also confirms the applicability of thedirected percolation universality class conjecture to the publicgoods game on the square lattice. For the results presented inFig. 4, we have simulated the public goods game on latticesof size × in the immediate proximity of the phasetransition point up to × full MCS, subsequently usingsmaller lattice sizes and shorter simulations times when be-ing further away from r c . With CPU-based computing, suchsimulations typically require weeks to complete on clusterswith many cores. As the inset of Fig. 4 shows, without us-ing larger lattices, for example staying at the × sizewhere CPU-based computing is not significantly slower thanparallel computing, the nature of the phase transition becomesdistorted. We refer to page in [15] for a classical statisticalmechanics example of qualitatively the same phenomenon.It remains of interest to quantitatively asses the advantagein time obtained by switching from CPU-based computing toparallel computing on the graphics processing unit. Resultspresented in Fig. 5 show that for CPU-based computing thesimulation times increase roughly with the square of the in-crease of the linear size of the lattice L . For example, if us-ing the same number of full MCS, it takes approximately fourtimes as much time to finish the simulation on a × square lattice as it does to finish the same simulation on the × square lattice. As the chosen linear size of the lat-tice grows, this factor of four grows ever so slightly towardsfive and above due to increasing memory demands. Note thatat × lattice size it takes in excess of five days for theCPU to complete full MCS. At × lattice sizeit takes nearly a month. Conversely, with parallel computingthe simulation times remain nearly constant as L increases tothe point where the number of threads does not exceed streaming multiprocessors present in the graphics card that we FIG. 5: Quantitative comparison of Monte Carlo simulation timesbetween CPU-based computing and parallel computing (see legend).As the benchmark, we have simulated the public goods game onthe L × L square lattice for full MCS, using K = 0 . and r = 3 . at which the fraction of cooperators in the stationary stateis ρ C = 0 . . CPU-based computing was done using an Intel Xeonprocessor with 2.80 GHz clock speed, while parallel computing wasdone using NVIDIA GeForce GTX 1080 graphic card with streaming multiprocessor each running at 1607 Mhz closk speed.The main panel shows the increase in the length of the simulationtime for CPU-based computing (red squares) and parallel computing(blue circles) while going from L = 32 to L = 4096 . The in-set shows the ratio between parallel computing time and CPU-basedcomputing time for the same span of L values. The dashed-dottedgray line in the inset marks the lattice size L e = 80 at which CPU-based computing is equally fast as parallel computing. Below L e CPU-based computing is faster, while above this lattice size parallelcomputing is faster. Lines are just to guide the eye. have used (NVIDIA GeForce GTX 1080). With the × do-main size the double tiling decomposition scheme, this yieldsa L = 400 , beyond which the simulation times start increas-ing with a factor between three and five for each doubling ofthe linear lattice size L . Accordingly, the larger the lattice sizethe greater the benefits from parallel computing.This is confirmed in the inset of Fig. 5, where the ratiobetween parallel computing time and CPU-based computingtime in dependence on L is shown. We note that the ratiocan also be smaller than one, in particular if the lattice is so small that the benefits of parallel computing can not be takenadvantage of with the constrain of a × domain size (weremind the reader that × is the smallest permissible do-main size to avoid prohibited memory overlaps between thedomains in Fig. 2, as explained with Fig. 3). In that case thenumber of threads does not offset the slower clock speed ofstreaming multiprocessors in comparison to the CPU clockspeed (Intel Xeon processor with 2.80 GHz clock speed). Thedashed-dotted gray line in the inset at L e = 80 marks thelattice size at which CPU-based computing is equally fast asparallel computing. When L > L e , however, the accelera-tion of the simulations is impressive. At L = 4096 , the ratiois ≈ , thus cutting the simulation time for full MCSfrom a month to less than an hour and a half at this lattice size.Of course still somewhat too long for classroom demonstra-tions, but such large lattice sizes are rarely needed (see [45]for an example). Taken together, results presented in Fig. 5confirm that the effort in successfully parallelizing the MonteCarlo simulation method is rewarded with simulation timesthat can be orders of magnitude shorter than attainable withconventional CPU-based computing. VI. DISCUSSION
We have advocated for the use of graphics processing unitsto bring high-performance parallel computing into the physicsclassroom. We have used the public goods game on the squarelattice as an example to demonstrate the parallelization of theMonte Carlo simulation method by means of the double tilingdecomposition scheme, and we have explained in detail thesubtleties of memory sharing and conflict avoidances when si-multaneously updating different parts of the lattice by severalthreads running in parallel at the same time. We have shownthat the parallelization preserves all the most important prop-erties of the public goods game related to statistical physics,in particular the continuous character of the C + D → D phase transition and the directed percolation universality classto which the phase transition belongs. While the calculationof these results would require weeks of many-core clusters ifdone with traditional CPU-based computing, a single capablegraphics card decreases this time by a factor of 500, thus mak-ing these phenomena viable for presentation in the classroom.We note that the described parallelization scheme can beeasily adapted so that Monte Carlo simulations of other evolu-tionary games on the square lattice can be performed, such asfor example the well-known prisoner’s dilemma game [61, 62]or other social dilemmas [29, 38]. An important differencewith regards to the public goods game is that the prisoner’sdilemma game is played in a pairwise manner, not in groups.As a consequence, the memory needed to update strategieswithin a given domain is one line outward less in each di-rection (see Fig. 3), which can serve as good practice whenadapting the source code. Of course, and as already demon-strated in original research published in the past [16–25, 27],the same approach can be used to simulate a broad variety ofclassical statistical physics systems.With the Moore’s law, stating that the number of transistorsin a dense integrated circuit doubles approximately every twoyears, arriving at it limits due to fundamental technical con-straints in CPU design, today’s hardware has to be designedin a multi-core manner to keep up. This in turn means thatif we want to benefit from faster simulation times, the soft-ware has to be written in a multi-threaded manner to take fulladvantage of the hardware. The graphic cards industry hasinvested admirable effort for this to materialize, one product of which is the friendly and thoroughly documented CUDAprogramming environment. The time is thus ripe for theseprogramming techniques to be introduced into graduate andadvanced undergraduate curricula, giving students the chanceto learn the benefits of parallel computing from the onset oftheir physics education. We hope that this paper will be use-ful to that effect. [1] Atomic Heritage Foundation, Computing and the ManhattanProject (atomicheritage.org/history, 2016).[2] M. E. J. Newman and G. T. Barkema,
Monte Carlo Methods inStatistical Physics (Oxford University Press, Oxford, 1999).[3] K. Binder and H. M¨uller-Krumbhaar, Phys. Rev. B , 2328(1974).[4] K. Binder and D. P. Landau, Phys. Rev. B , 1941 (1980).[5] K. Binder, Z. Phys. B Condensed Matter , 119 (1981).[6] K. Binder, Rep. Prog. Phys. , 487 (1997).[7] H. Hinrichsen, Adv. Phys. , 815 (2000).[8] G. ´Odor, Rev. Mod. Phys. , 663 (2004).[9] K. Binder and D. K. Hermann, Monte Carlo Simulations in Sta-tistical Physics (Springer, Heidelberg, 1988).[10] J. Marro and R. Dickman,
Nonequilibrium Phase Transitions inLattice Models (Cambridge University Press, Cambridge, U.K.,1999).[11] D. Landau and K. Binder,
A Guide to Monte Carlo Simulationsin Statistical Physics (Cambridge University Press, Cambridge,2000).[12] E. Ising, Z. Phys. , 253 (1925).[13] L. Onsager, Phys. Rev. , 117 (1944).[14] E. Ibarra-Garc´ıa-Padilla, C. G. Malanche-Flores, and F. J.Poveda-Cuevas, Eur. J. Phys. , 065103 (2016).[15] W. Krauth, Statistical Mechanics: Algorithms and Computa-tions (Oxford University Press, Oxford, 2006).[16] T. Preis, P. Virnau, W. Paul, and J. J. Schneider, J. Comput.Phys. , 4468 (2009).[17] H. Schulz, G. ´Odor, G. ´Odor, and M. F. Nagy, Comp. Phys.Commun. , 1467 (2011).[18] J. Kelling, G. Odor, M. F. Nagy, H. Schulz, and K.-H. Heinig,Eur. Phys. J. Special Topics , 175 (2012).[19] J. Kelling and G. ´Odor, Phys. Rev. E , 061150 (2011).[20] G. ´Odor, J. Kelling, and S. Gemming, Phys. Rev. E , 032146(2014).[21] M. Januszewski and M. Kostur, Comp. Phys. Commun. ,183 (2010).[22] J. Spiechowicz, M. Kostur, and L. Machura, Comp. Phys. Com-mun. , 140 (2015).[23] A. Barros, E. de Vilhena Garcia, and R. M. Silva, Comp. Phys.Commun. , 989 (2011).[24] A. Seibert, S. Denisov, A. Ponomarev, and P. H¨anggi, Chaos , 043123 (2011).[25] A. Y. Polyakov, T. Lyutyy, S. Denisov, V. Reva, and P. H¨anggi,Comp. Phys. Commun. , 1483 (2013).[26] E. Alerstam, T. Svensson, and S. Andersson-Engels, J. Biomed.Optics , 060504 (2008).[27] R. Juh´asz and G. ´Odor, J. Stat. Mech. , P08004 (2012).[28] S. Tomov, M. McGuigan, R. Bennett, G. Smith, and J. Spiletic,Computers & Graphics , 71 (2005).[29] G. Szab´o and G. F´ath, Phys. Rep. , 97 (2007).[30] M. Perc, J. G´omez-Garde˜nes, A. Szolnoki, and L. M. Flor´ıa and Y. Moreno, J. R. Soc. Interface , 20120997 (2013).[31] K. Sigmund, Games of Life: Exploration in Ecology, Evolutionand Behavior (Oxford University Press, Oxford, UK, 1993).[32] J. W. Weibull,
Evolutionary Game Theory (MIT Press, Cam-bridge, MA, 1995).[33] J. Hofbauer and K. Sigmund,
Evolutionary Games and Popula-tion Dynamics (Cambridge University Press, Cambridge, U.K.,1998).[34] M. A. Nowak,
Evolutionary Dynamics (Harvard UniversityPress, Cambridge, MA, 2006).[35] K. Sigmund,
The Calculus of Selfishness (Princeton UniversityPress, Princeton, NJ, 2010).[36] G. Hardin, Science , 1243 (1968).[37] C. Castellano, S. Fortunato, and V. Loreto, Rev. Mod. Phys. ,591 (2009).[38] M. Perc and A. Szolnoki, BioSystems , 109 (2010).[39] Z. Wang, L. Wang, A. Szolnoki, and M. Perc, Eur. Phys. J. B , 124 (2015).[40] M. R. D’Orsogna and M. Perc, Phys. Life Rev. , 1 (2015).[41] R. Pastor-Satorras, C. Castellano, P. Van Mieghem, andA. Vespignani, Rev. Mod. Phys. , 925 (2015).[42] Z. Wang, C. T. Bauch, S. Bhattacharyya, A. d’Onofrio, P. Man-fredi, M. Perc, N. Perra, M. Salath´e, and D. Zhao, Phys. Rep. , 1 (2016).[43] M. Perc, Phys. Lett. A , 2803 (2016).[44] A. Szolnoki, M. Perc, and G. Szab´o, Phys. Rev. E , 056109(2009).[45] A. Szolnoki and M. Perc, Phys. Rev. X , 041021 (2013).[46] M. A. Javarone, A. Antonioni, and F. Caravelli, EPL , 38001(2016).[47] M. A. Javarone and F. Battiston, J. Stat. Mech. , 073404(2016).[48] M. A. Nowak and R. M. May, Nature , 826 (1992).[49] R. Dickman, Phys. Rev. E , 016124 (2001).[50] A. Szolnoki, Phys. Rev. E , 057102 (2002).[51] R. Dickman, Phys. Rev. E , 036122 (2002).[52] A. Szolnoki, G. Szab´o, and M. Ravasz, Phys. Rev. E , 027102(2005).[53] NVIDIA Corporation, CUDA Programming Guide (docs.nvidia.com/cuda/cuda-c-programming-guide, 2017).[54] G. Szab´o and C. Hauert, Phys. Rev. Lett. , 118101 (2002).[55] G. Szab´o and C. T˝oke, Phys. Rev. E , 69 (1998).[56] Y. Shim and J. G. Amar, Phys. Rev. B , 125432 (2005).[57] M. A. Nowak, Science , 1560 (2006).[58] D. Helbing, A. Szolnoki, M. Perc, and G. Szab´o, New J. Phys. , 083005 (2010).[59] H.-K. Janssen, Z. Phys. B Condensed Matter , 151 (1981).[60] P. Grassberger, Z. Phys. B Condensed Matter , 365 (1982).[61] C. Hauert and G. Szab´o, Am. J. Phys. , 405 (2005).[62] M. A. Javarone, Eur. Phys. J. B89