[PDF] Connected component identification and cluster update on GPU

Abstract

Cluster identification tasks occur in a multitude of contexts in physics and engineering such as, for instance, cluster algorithms for simulating spin models, percolation simulations, segmentation problems in image processing, or network analysis. While it has been shown that graphics processing units (GPUs) can result in speedups of two to three orders of magnitude as compared to serial codes on CPUs for the case of local and thus naturally parallelized problems such as single-spin flip update simulations of spin models, the situation is considerably more complicated for the non-local problem of cluster or connected component identification. I discuss the suitability of different approaches of parallelization of cluster labeling and cluster update algorithms for calculations on GPU and compare to the performance of serial implementations.

Full PDF

CConnected component identiﬁcation and cluster update on GPU

Martin Weigel

Institut f¨ur Physik, Johannes Gutenberg-Universit¨at Mainz, Staudinger Weg 7, D-55099 Mainz, Germany ∗ (Dated: October 24, 2018)Cluster identiﬁcation tasks occur in a multitude of contexts in physics and engineering such as,for instance, cluster algorithms for simulating spin models, percolation simulations, segmentationproblems in image processing, or network analysis. While it has been shown that graphics processingunits (GPUs) can result in speedups of two to three orders of magnitude as compared to serial codeson CPUs for the case of local and thus naturally parallelized problems such as single-spin ﬂip updatesimulations of spin models, the situation is considerably more complicated for the non-local problemof cluster or connected component identiﬁcation. I discuss the suitability of diﬀerent approachesof parallelization of cluster labeling and cluster update algorithms for calculations on GPU andcompare to the performance of serial implementations. PACS numbers: 05.10.Ln, 75.10.Hk, 05.10.-a

I. INTRODUCTION

Due to their manifold applications in statistical andcondensed matter physics ranging from the descriptionof magnetic systems over models for the gas-liquid tran-sition to biological problems, classical spin models havebeen very widely studied in the past decades. Sinceexact solutions are only available for a few exceptionalcases [1], with the steady increase in available computerpower and the advancement of simulational techniques,in many cases computer simulations have become thetool of choice even above the more traditional varia-tional and perturbative techniques [2]. The workhorseof Monte Carlo simulations in the form of the Metropo-lis algorithm [3] is extremely general and robust, butsuﬀers from problems of slowed down dynamics in thevicinity of phase transitions, or for systems with complexfree-energy landscapes. For the case of continuous phasetransitions, critical slowing down is observed with auto-correlation times increasing as τ ∼ L z with z ≈ z over the local value z ≈ for practically considered system sizes.Incidentally, the practical task of cluster identiﬁcationresulting from the probabilistic description of the prob-lem as a bond-correlated percolation model is identicalto that encountered in image segmentation or computervision, where neighboring pixels should be lumped to-gether according to their colors, a problem which can be ∗ mapped to the Potts model [8, 9]. Also, numerical sim-ulations of percolation problems, with their wide rangeof realizations from ﬂuids in porous media to epidemicspreading [10], must deal with a very similar problem ofcluster identiﬁcation (see, e.g., Ref. [11]). Further appli-cations occur in network analysis, particle tracking or theidentiﬁcation of structures such as droplets in condensedmatter. Eﬃcient implementations of cluster labeling al-gorithms are, therefore, of signiﬁcant interest for a num-ber of diﬀerent applications in scientiﬁc computing.In parallel to the invention of new simulation algo-rithms, the need for strong computing power for tack-ling hard problems has prompted scientists to alwaysmake the best use of the available computer resourcesof the time, be it regular PCs, vector computers or Be-owulf clusters. For the case of simulations of spin models,for instance, a number of special purpose computers hasbeen devised, including machines for local updates suchas JANUS for spin glasses [12] and variants such as the“cluster processor” using cluster-update algorithms [13].While these were (and are) highly successful in their spe-ciﬁc application ﬁelds, their design and realization is arather challenging endeavor, costly in terms of monetaryas well as human resources. It is therefore desirable tosearch for a less application speciﬁc, but still highly per-formant platform for massively parallel scientiﬁc comput-ing that is less expensive in terms of its acquisition as wellas its power consumption and cooling requirements thantraditional cluster computers. An architecture meetingthose standards has become available in recent years withthe advent of general purpose computing on graphicsprocessing units (GPUs) [14, 15]. With the availabilityof convenient application programming interfaces (APIs)for GPU computing, most notably NVIDIA CUDA andOpenCL [15], the programming eﬀort does no longer dra-matically exceed that of CPU based parallel machines.Still, for eﬃcient implementations architectural peculiar-ities of these devices, in particular the organization ofcompute units (cores) in groups (multiprocessors) withreduced synchronization capabilities between multipro-cessors and the pyramid of memories with latencies, sizes a r X i v : . [ phy s i c s . c o m p - ph ] J un and access scopes decreasing from base to tip, need to betaken into account. For the case of spin models, a widerange of simulation algorithms with local updates hasbeen previously implemented on GPU [16–19], where forthe implementations reported in Refs. [18, 20, 21] signif-icant speedups of two to three orders of magnitude ascompared to serial CPU codes have been reported. Aneﬃcient parallelization of non-local algorithms and clus-ter labeling is signiﬁcantly more challenging, however, inparticular for the case of cluster updates for spin modelsclose to criticality, where the relevant clusters undergoa percolation transition and are therefore spanning thewhole system [22–29].The implementations discussed here have been realizedwithin the NVIDIA CUDA [30] framework with bench-marks performed on the GTX 480, GTX 580 and TeslaM2070 GPUs. While some of the details are speciﬁc tothis setup, the algorithmic approaches discussed are fairlygeneral and could easily applied to other GPU devicesor realized with diﬀerent APIs such as OpenCL. For anintroduction into the details of the GPU hardware andthe corresponding programming models, the reader is re-ferred to the available textbooks (see, e.g, Ref. [15]) andprevious articles by the present author [18, 20].The rest of the paper is organized as follows. In Sec-tion II GPU implementations of the cluster algorithmof Swendsen and Wang [4] are discussed. The cluster de-composition of the complete spin lattice necessary here isidentical to that of a corresponding image segmentationproblem or percolation simulation. Section III is devotedto the case of the single-cluster variant suggested by Wolﬀ[5]. Finally, Section IV contains my conclusions. II. SWENDSEN-WANG ALGORITHM

In this paper, I focus on the ferromagnetic q -state Pottsmodel with Hamiltonian H = − J (cid:88) (cid:104) ij (cid:105) δ s i s j , (1)where s i ∈ { , . . . , q } denote the spin variables, J > d >

1, the model undergoes atransition from a disordered phase at high temperaturesto an ordered phase where one of the q states prevails atlow temperatures [31]. For d = 2, the transition is con-tinuous for q ≤ q >

4, while in d = 3it is ﬁrst order for any q ≥

3. The special case q = 2 isequivalent to the celebrated Ising model. A local MonteCarlo simulation of the Potts model proceeds by itera-tively changing the orientation of randomly chosen spinvariables in accordance with the detailed balance condi-tion [2]. In contrast, the cluster algorithm of Swendsenand Wang [4] updates connected components of (usually)more than one spin and is based on the following trans-formation of the partition function due to Fortuin and Kasteleyn [32], Z = (cid:88) { s i } exp  βJ (cid:88) (cid:104) ij (cid:105) δ s i s j  (2)= (cid:88) { s i } (cid:89) (cid:104) ij (cid:105) e βJ (cid:2) (1 − p ) + pδ s i s j (cid:3) (3)= (cid:88) { n ij } (cid:88) { s i } (cid:89) (cid:104) ij (cid:105) e βJ (cid:2) (1 − p ) δ n ij, + pδ s i s j δ n ij , (cid:3) , (4)where β denotes the inverse temperature and p = 1 − e − βJ . In Eq. (4), a set of auxiliary bond variables n ij ∈ { , } is introduced, where n ij = 0 whenever s i (cid:54) = s j and n ij = 1 with probability p for s i = s j .The resulting stochastically deﬁned clusters are thereforesubsets of the geometric clusters of parallel spins. Usinga graphical expansion of the term in square brackets inEq. (4) and summing over the spin conﬁgurations { s i } ,it can be shown that the model is equivalent to a gener-alized percolation model with partition function [32, 33] Z = e βJ (cid:88) { n ij } p b ( { n ij } ) (1 − p ) E− b ( { n ij } ) q n ( { n ij } ) , (5)known as the random-cluster model. Here, b ( { n ij } ) de-notes the number of activated edges resulting from thebond variables n ij , n ( { n ij } is the number of connectedcomponents of the induced subgraph, and E is the totalnumber of edges in the underlying graph or lattice. Fromthe percolation representation (5) it is clear [34] that thestochastic clusters induced by the bond variables n ij (andnot the geometric clusters of like spins) undergo a per-colation transition at the thermal transition point, andhence it is these structures that should be updated toeﬃciently decorrelate the system close to criticality.Utilizing the representation (4) the algorithm bySwendsen and Wang alternatingly updates spins s i andbond variables n ij as follows:1. For a given spin conﬁguration set n ij = 0 for eachbond with s i (cid:54) = s j . Set n ij = 1 and n ij = 0 withprobabilities p and 1 − p , respectively, for each bondwith s i = s j .2. Identify the connected components of the subgraphof the lattice induced by the bond variables n ij .3. Choose a new spin orientation randomly in { , . . . , q } for each connected component and up-date the spin variables s i accordingly.Since clusters of single spins are possible, this updateis trivially ergodic. It is straightforward to show thatdetailed balance is fulﬁlled [4, 7]. Hence, the Swendsen-Wang (SW) dynamics forms a valid Markov chain MonteCarlo algorithm of the Potts model. Autocorrelationsare dramatically reduced as compared to local spin ﬂips.A rigorous bound for the dynamical critical exponent is z int ≥ α/ν [35], where z int is the exponent of the scalingof the integrated autocorrelation time and α and ν are the(static) critical exponents of the speciﬁc heat and the cor-relation length, respectively. This bound is close to sharpin two dimensions [36], but not in d = 3 where, never-theless, signiﬁcant reductions in autocorrelations and thedynamical critical exponent z are observed.Attempting a highly parallel GPU implementation ofthe SW algorithm, it is clear that the bond activation instep 1 as well as the cluster ﬂipping in step 3 can be rathereasily parallelized as they are perfectly local operations.In contrast, the cluster identiﬁcation in step 3 must dealwith structures spanning the whole system, in particularfor simulations close to criticality which are the mainstrength of cluster updates. This is also the crucial stepfor further applications of cluster identiﬁcation such asthe image segmentation problem mentioned above. Thetotal run time for a single update of the spin lattice withthe SW algorithm on a single GPU therefore decomposesas T p SW = T p activate + T p identify + T p ﬂip . (6)We distinguish these times from the corresponding se-rial run times T s SW , T s activate , etc. for single-threaded cal-culations. For deﬁniteness, the implementation is dis-cussed in some detail for the speciﬁc example of the Pottsmodel on the square lattice of edge length L with peri-odic boundary conditions. Generalizations to three di-mensions or other lattice types are straightforward. A. Bond activation

We use an array of 2 L char variables to represent thebond activation states n ij . For the GPU implementationusing CUDA [15], bond activation is performed by a ﬁrstkernel, prepare bonds() . Given a conﬁguration of thespins s i , for each bond an expression of the form n ij = (cid:26) s i = s j and r < p r ∈ (0 ,

1) is a uniform(pseudo) random number, and p = 1 − e − βJ . To en-able parallelism, the system is broken up into tiles of B = B x × B y spins, and each tile is assigned to anindependent thread block. If we denote (cid:96) x = L/B x , (cid:96) y = L/B y and (cid:96) = (cid:96) x (cid:96) y the number of tiles, the ex-pected parallel run time behaves as T p activate ∼ (cid:96) min( (cid:96) , n ) B min( B /k, m ) , (8)where n denotes the number of multiprocessors ( n = 14for Tesla M2070, n = 15 for GTX 480, n = 16 for GTX580), m is the number of cores per multiprocessor ( n = 32for all three cards), and k is the number of sites as-signed to each thread[37]. For large systems, (cid:96) > n and B /k > m , Eq. (8) reduces to T p activate ∼ (cid:96) B = L . As discussed in detail in Refs. [18, 20], each thread requiresits own instance of a random number generator (RNG)to prevent the formation of a performance bottleneck.Due to the resulting large number of RNG instances (forthe case of large systems), one requires a generator witha small state comprising, ideally, not more than a fewbytes. This precludes the use of high-quality, but largestate generators such as Mersenne twister [38] in applica-tions of the type considered here. Additionally, one needsto ensure that the thus created streams of random num-bers are suﬃciently uncorrelated with each other. Suit-able generator types for this purpose are, for instance, ar-rays of linear congruential generators with random seeds,which are fast but might not produce random numbersof suﬃcient quality [16, 18, 39], generalized lagged Fi-bonacci generators [18], or the Marsaglia generator assuggested in Ref. [19]. As the cluster identiﬁcation step,which does not require random numbers, dominates theparallel runtime of the algorithm, RNG speed is not asimportant as in local update simulations on GPUs. Forthe benchmarks reported below, I used an array of 32-bitlinear congruential generators. Statistically signiﬁcantdeviations from the exact results [40] for the q = 2 Pottsmodel at criticality have not been observed.An analysis of the kernel with CUDA’s Compute Vi-sual Proﬁler [30] shows that its performance is computebound. Still, memory performance can be improved byusing an appropriate memory layout ensuring that readsof subsequent threads in the spin and bond arrays mapto consecutive locations in global memory to ensure coa-lescence of memory requests [15]. With a linear memoryarrangement these requirements are best met when us-ing tiles with B x (cid:29) B y . Best results for systems with L >

256 are found here for B z = 256, B y = 4 (con-sidering only lattice sizes L = 2 n , n ∈ N ). Evaluat-ing the acceptance criterion leads to unavoidable threaddivergence, but the eﬀects are not very dramatic here.The asymptotic performance of the kernel with one spinper thread, k = 1, is T p activate /L = 0 .

66 ns on theGTX 480 (assuming full loading of the multiprocessorswhich is reached for suﬃciently large systems). An alle-viating eﬀect on thread divergence and memory limita-tions is reached by assigning several spin pairs (bonds)to each thread. Two versions have been considered here,either assigning a (square) sub-block of four spins toeach thread ( k = 4), which brings down the updatingtime to T p activate /L = 0 .

46 ns, or assigning only B x threads per tile, each of which has to update k = B y spins, leading to the same asymptotic performance of T p activate /L = 0 .

46 ns on the GTX 480. The better per-formance of these variant kernels has to do with the pos-sibility of pre-fetching of data into registers while arith-metic operations are being performed. The same kernelis used to also initialize the cluster labels (see below).Note that the relatively lower performance of this kernelas compared to the Metropolis update of the Ising modelreported in Ref. [18] of about 0 .

13 ns per spin (withoutmulti-hit updates) on the same hardware is explained by

56 57 58 59 60 61 62 6348 49 50 51 52 53 54 5540 41 42 43 36 36 46 4732 33 34 36 36 36 38 3924 25 26 36 28 36 30 3116 17 18 19 20 21 22 238 9 10 11 12 13 14 150 1 2 3 4 5 6 7

FIG. 1. (Color online) Cluster identiﬁcation on a 64 ×

64 tileusing a breadth-ﬁrst search. The already labeled sites areindicated in blue (dark squares) while the current wave frontof unvisited neighbors is shaded in red (light squares). the 6-fold increase in memory writes (two char s and one int versus one char ) and the use of two random numbers(instead of one) per spin.

B. Cluster labeling on tiles

To allow for an eﬃcient use of parallel hardware, clus-ter labeling is ﬁrst performed on (square) tiles of B × B spins and cluster labels are consolidated over the wholelattice in a second step (see Sec. II C below) [22–24, 26–28]. Hence the time for cluster identiﬁcation naturallybreaks up into two contributions, T p identify = T p local + T p global . (9)In the ﬁeld of simulations of spin systems (and percola-tion), the standard technique for cluster identiﬁcation isdue to Hoshen and Kopelman [11]. Although originallynot formulated in this context, it turns out to be a spe-cial version of the class of union-and-ﬁnd algorithms wellknown in theoretical computer science [41]. Time andstorage requirements for this approach scale linearly withthe number N = L of sites. A somewhat more “natural”approach consists of a set of breadth-ﬁrst searches on thegraph of bonds, growing the clusters in a layered fashion.While storage requirements are super-linear in N (andmight be as large as N depending on the structure ofthe underlying graph), computing time scales still linearin N and implementations are typically very straight-forward and eﬃcient. A third approach considered here,dubbed self-labeling [23], is very ineﬃcient regarding (se-rial) computing time, but very well suited for paralleliza-tion.

1. Breadth-ﬁrst search

In breadth-ﬁrst search (or “ants in the labyrinth”) theunvisited neighbors of a starting vertex or seed that areconnected by activated bonds are examined and stored ina ﬁrst-in-ﬁrst-out (FIFO) data structure (a queue). Sub-sequently, nodes are removed from the queue in the orderthey have been stored and examined in the same fashionas the initial vertex. This leads to a layered growth ofthe identiﬁed part of a cluster as illustrated in Fig. 1.The complete set of clusters is being identiﬁed by seed-ing a new cluster at each node of the lattice that is notyet part of a previously identiﬁed cluster. Informationabout the cluster structure is stored in an array of clus-ter labels, where originally each cluster label is initializedwith the site number on the lattice and cluster labels areset to that of the seed site on growing the cluster, cf.Fig. 1. While this approach is very general (it can beapplied without changes to any graph) and well suitedfor serial calculations, it is not very suitable for paral-lelization [42]. Parallelism can be implemented in theexamination of diﬀerent neighbors of a site and in pro-cessing the sites on the wave front of the growing cluster.To avoid race conditions and achieve consistency of thedata structures, however, locks or atomic operations arerequired, considerably complicating the code. Addition-ally, the number of (quasi) independent tasks is highlyvariable as the length of the wave fronts is ﬂuctuatingquite strongly. For the case of a parallel identiﬁcation of all clusters as necessary for the SW algorithm and imagesegmentation, this approach is hence not very well suitedfor a GPU implementation. A parallel implementationwill be discussed below in the context of the single-cluster(or Wolﬀ) variant of the algorithm in Sec. III, however.The parallel run time of this kernel, local BFS() , em-ploying one thread per block performing cluster identiﬁ-cation in a tile of edge length B , is therefore expected toscale as T p local ∼ (cid:96) min( (cid:96) , n ) B . (10)The measured run times for (cid:96) > n follow this expec-tation, resulting in perfectly linear scaling of the time T p local /(cid:96) per tile with the number B of tile sites, cf.Fig. 2. Since only a maximum of 8 thread blocks canbe simultaneously active on each multiprocessor on cur-rent generation NVIDIA GPUs [15], however, 24 of the32 cores of each multiprocessor are idling, leading torather low performance. The asymptotic maximum per-formance for large system sizes (leading to an optimumeﬀect of latency hiding through the scheduler) on theGTX 480 is at around T p local /L = 13 . local BFS() . ∼ B . ∼ B ∼ B . T p l o c a l / ℓ [ n s ] B ∼ B . ∼ B ∼ B . breadth-ﬁrst searchunion-and-ﬁndself-labeling FIG. 2. (Color online) Parallel average run time for localcluster labeling on a 4096 × B . Data are for the q = 2 states Potts model atthe critical point. Breadth-ﬁrst search and tree-based union-and-ﬁnd are (up to logarithmic corrections) proportional tothe number B of sites, while self-labeling exhibits scalingproportional to B d min ≈ B . . The weaker scaling propor-tional to B d min ≈ B . of self-labeling for small B is due tounder-utilization of GPU cores (see main text). The lines areﬁts of the power law T p local /(cid:96) = AB κ with the indicated ﬁxedexponents to the data.

2. Union-and-ﬁnd algorithms

It is a well known problem in computer science to par-tition a set of elements into disjoint subsets according tosome connectedness or identity criterion. A number ofeﬃcient algorithm for this so-called union-and-ﬁnd prob-lem have been developed [41]. Consider a set of N el-ements denoted as vertices in a graph which, initially,has no edges. Now, a number of edges is sequentiallyinserted into the graph and the task is to successivelyupdate a data structure that contains information aboutthe connected components resulting from the edge inser-tion. Obviously, our cluster identiﬁcation task is a specialcase of this problem. In a straightforward implementa-tion one maintains a forest of spanning trees where eachnode carries a pointer to its parent in the tree, unlessit is the tree root which points to itself. On insertionof an edge one determines the roots of the two adjacentvertices by successively walking up the respective treestructures ( ﬁnd ). If the two roots found are the same,the inserted edge was internal and no further action isrequired. If two diﬀerent roots were found, the edge wasexternal and one of the trees is attached to the root ofthe other as a new branch ( union ), thus amalgamatingtwo previously disjoint subsets or connected componentsof the graph. The forest structure can be realized with

56 57 58 59 60 61 62 6348 41 41 51 52 53 54 5540 32 41 41 44 45 46 4732 32 34 30 30 30 38 3924 25 26 27 30 30 13 3116 17 18 19 20 21 13 238 9 10 11 12 13 13 150 1 2 3 4 5 6 7

FIG. 3. (Color online) Cluster labeling using union-and-ﬁndwith balanced trees and (partial) path compression on a 64 ×

64 tile. Under insertion of the edge between sites 30 and 41,the smaller cluster with root No. 32 is attached to the root ofthe larger cluster at No. 13. an array of node labels, where each node is initialized topoint to itself (i.e., it is its own root). This process isillustrated for the present application in Fig. 3.(Worst case) time complexity is trivially constant or O (1) for union steps, while ﬁnd steps can be extensive, O ( N ), if edges connecting macroscopic clusters are con-sidered. (Storage requirements are clearly just linear in N .) The complexity of the ﬁnd step can be reduced bytwo tricks, tree balancing and path compression. Bal-ancing can be achieved by making sure that always thesmaller tree (in terms of the number of nodes) is attachedto the larger. To this end, the current number of nodesis stored in the tree root. Balancing reduces the time toﬁnd the root to O (log N ) steps [41]. Similarly path com-pression, which redirects the “up” pointer of each node topoint directly to the tree root in a backtracking operationafter each completed ﬁnd task, reduces ﬁnd complexityto O (log N ). The combination of both techniques can beshown to result in an essentially constant ﬁnd complexityfor all practically relevant system sizes [43]. An imple-mentation of the full algorithm geared towards clusteridentiﬁcation is described in Ref. [44].Like the breadth-ﬁrst search, the tree-based union-and-ﬁnd approach is intrinsically serial as all operations workon the same forest structure, whose consistency could notbe easily maintained under parallel operations. Moder-ate parallelsim is possible in the union step, where thetwo ﬁnd operations for the vertices connected by the newedge can be performed in parallel. Due to the resultantthread divergence, however, using two threads per blockis found to actually decrease performance. Similarly, theextra eﬀect of path compression (keeping the stack forbacktracking in fast shared memory) is found to actu-ally increase run times, at least in the range of tile sizes4 ≤ B ≤

64 considered. The parallel scaling of the al-gorithm is thus the same (up to logarithmic corrections)

56 57 58 59 60 61 62 6348 49 49 51 52 53 54 5540 33 42 43 44 45 46 4732 33 34 35 36 37 38 3924 25 26 27 28 29 30 3116 17 18 19 20 21 22 238 9 10 11 12 13 14 150 1 2 3 4 5 6 7

FIG. 4. (Color online) Cluster identiﬁcation on a 64 ×

64 tileusing the self-labeling algorithm with one thread per 2 × as that of breadth-ﬁrst search given in Eq. (10). In fact, T p local /(cid:96) is found to be almost perfectly linear in B in theconsidered regime, cf. the data presented in Fig. 2. Theasymptotic performance (neglecting logarithmic termsdue to the ﬁnd step) of the kernel local unionfind() on the GTX 480 is found to be T p local /L = 8 . local BFS() . Note that for thetree-based algorithms of the union-and-ﬁnd type, mem-ory accesses are inherently non-local leading to a certainperformance penalty which hardly can be avoided.

3. Self-labeling

While breadth-ﬁrst search and tree-based union-and-ﬁnd are elegant and very eﬃcient for serial implemen-tations, they appear not very suitable for parallelization,especially on GPUs where groups of threads are executedin perfect synchrony or lockstep and extensive thread di-vergence is expensive. An antipodal type of algorithm isgiven by the simple approach of self-labeling [23]: Clus-ter labels are initialized with the site numbers. Each sitethen compares its label to that of its eastward neigh-bor and sets its own and this neighbor’s labels to theminimum of the two, provided that an activated bondconnects the two sites. The same is subsequently donewith respect to the northward neighbor, cf. Fig. 4. (On asimple cubic lattice, the analog procedure would involvethree out of six neighbors.) Clearly, the outcome of awhole sweep of re-labeling events will depend on the or-der of operations and several passes through the tile willbe necessary until the ﬁnal cluster labels have propagatedthrough the whole system. Eventually, however, no labelwill have changed in a whole pass through the tile andthe procedure can be stopped, leading to a correct label- ing of clusters inside of each tile. Let us ﬁrst concentrateon the spin model at criticality. Then, clusters typicallyspan the tile, such that at least of the order of B sweepswill be required to pass information about correct clusterlabels from one end of the tile to the other. In fact, evenmore passes are necessary, as information about clusterlabels needs to be diﬀusively transmitted between eachpair of sites in the same cluster. Since under the cho-sen dynamics this information will be transmitted alongthe shortest path connecting the two sites, the requirednumber of sweeps will scale as n B ∼ B d min , (11)where d min ≥ q → d min ≈ .

13 in d = 2 and d min ≈ .

34 in d = 3 [45], whereas for the (Fortuin-Kasteleyn clusters ofthe) q = 2 and q = 3 Potts models in two dimensions ithas been estimated as d min = 1 . d min = 1 . k > T p local = C local (cid:96) min( (cid:96) , n ) B min( B /k, m ) B d min (12)at or close to the percolation transition, which asymptot-ically appears to be rather unﬂattering as compared tothe breadth-ﬁrst search and union-and-ﬁnd techniques.Due to the parallelization on the tile level, however, thetotal run time can still be quite low for intermediate tilesizes. Oﬀ criticality, the scaling becomes somewhat morefavorable. Below the transition, where clusters span thelattice, but they are no longer fractal, d min should be re-placed by one. Above the transition, on the other hand,with a ﬁnite correlation length ξ , B d min in Eq. (12) isreplaced by min( ξ, B ). While this somewhat better be-havior is probably not very relevant for the spin modelsas simulations close to criticality are the main purposeof cluster-update algorithms, it is of importance for per-colation simulations or image segmentation problems forthe (typical) case of a ﬁnite characteristic length scale ξ .Figure 2 shows the scaling of parallel run times for thekernel local selflabel() on tiles of sizes 4 ≤ B ≤ q = 2 Potts model at the critical point β c =ln(1 + √ T p local /(cid:96) ∼ B d min ≈ B . for B /k < m and T p local /(cid:96) ∼ B d min ≈ B . for B /k > m . (The datain Fig. 2 are for k = 4 on the GTX 480 with m = 32,such that the crossover occurs at B ≈ B ≤

64 self-labeling isclearly superior in parallel performance on GPU as com-pared to breadth-ﬁrst search or union-and-ﬁnd, althoughit becomes less eﬃcient than the latter two approachesfor B (cid:38) local selflabel small() , that assigns onespin per thread, restricting the tile size to B ≤

32 oncurrent NVIDIA GPUs with a limitation of 1024 threadsper block, (b) a kernel local selflabel() which assignsa 2 × B = 64, and (c) a looped version, local selflabel -block() , that assigns one column of height B to eachthread, such that the lines are worked on through a loop.In all cases, the relevant portion of the bond activationvariables and cluster labels are copied to shared memory,such that memory fetches in the re-labeling steps are (al-most) instantaneous. Bank conﬂicts are avoided throughan appropriate layout of the data in shared memory. De-pending on the number of spins per thread, a diﬀerentorder of operations can lead to diﬀerent results for eachsingle self-labeling pass. Consistency could be enforcedvia atomic operations, but these slow down the code andare found to be not necessary here. Therefore, while thenumber of necessary self-labeling passes might vary fromrun to run (or device to device) depending on schedulingspeciﬁcities, the ﬁnal result is deterministic and does notdepend on the order of operations. The decision aboutthe end of self-labeling is taken using the warp vote func-tion syncthreads or() [30] which evaluates to true aslong as any of the threads has seen a re-labeling eventin the last pass. Performance diﬀerences between thementioned three kernels are found to be relatively small.The best asymptotic performance is observed for the ker-nel local selflabel() with 2 × B = 16 on the GTX 480 the run timeper spin is T local /L = 1 .

08 ns for all labeling passes.While the total number of operations is larger for self-labeling than for breadth-ﬁrst search or union-and-ﬁnd,the former is 13 and 8 times faster than the latter at B = 16, respectively, due to the easily exploited inherentparallelism. C. Tile consolidation

Each of the three cluster labeling algorithms on tilesdiscussed above results in correct cluster labels insideof tiles, however, ignoring the information of any activebonds crossing tile boundaries. To reach uniﬁed labelsfor clusters spanning several tiles, an additional consoli-dation phase is necessary. Two alternatives, an iterativerelaxation procedure and a hierarchical sewing schemehave been considered to this end.

1. Label relaxation

Cluster labels can be consolidated across tile bound-aries using a relaxation procedure similar to the self-labeling employed above inside of tiles [28]. In a prepa-ration step, for each edge crossing a tile boundary theindices of the cluster roots of the two sites connectedby the boundary-crossing bond are stored in an array, cf.

FIG. 5. (Color online) Tile consolidation via label relaxation.For each spin on the boundary of a tile (squares) with an oﬀ-tile active bond, the local root nodes (circles) are stored inan array. The corresponding local root labels are transmit-ted to neighboring tiles, who change their local labels to theminimum of their own and the received labels.

Fig. 5 (kernel prepare relax() ). In the relaxation phaseeach tile sets the root labels of its own active boundarysites to the minimum of its own current label and thatof the corresponding neighboring tile. Relaxation stepsare repeated until local cluster labels do not change anyfurther. Similar to self-labeling, the number of relaxationsteps scales as the shortest path between two points onthe largest cluster(s), however, the relevant length scalefor the relaxation procedure is now (cid:96) = L/B , leading tothe following scaling behavior at the percolation thresh-old n relax ∼ (cid:96) d min . (13)For systems below the transition temperature or moregeneral cluster identiﬁcation tasks with extensive, butnon-fractal clusters, d min is replaced by 1, whereas abovethe critical point and for other problems with ﬁnite char-acteristic length scales n relax ∼ ξ/B . The number ofiterations n relax is shown for a simulation of the q = 2Potts model at criticality in Fig. 6. The expected scal-ing with d min = 1 .

08 [46] is well observed for suﬃcientlylarge system sizes across all tile sizes B : a ﬁt of the func-tional form (13) results in d min = 1 . n relax with increasing B re-sults from concurrency eﬀects: for a small total numberof tiles many of them are treated at the same time ondiﬀerent multiprocessors, resulting in the possibility of alabel propagating to tiles more than one step away in onepass if (as is likely) several of the boundary sites belongto the same clusters.The number of operations per relaxation iteration isproportional to the length of the tile boundary times thenumber of tiles, i.e., t relax ∼ B(cid:96) . (14) ∼ ℓ . n r e l a x ℓ ∼ ℓ . B = 4 B = 8 B = 16 B = 32 B = 64 FIG. 6. (Color online) Required number n relax of iterationsfor the label relaxation technique for tile consolidation as afunction of the renormalized system size (cid:96) = ( L/B ). The lineis a ﬁt of the form (13) to the data for (cid:96) ≥ d min = 1 . The relaxation routine (kernel relax() ) appears intrin-sically serial in nature as diﬀerent boundary spins canpoint to the same roots such that concurrent operationscould lead to inconsistencies, unless appropriate locks areused. Nevertheless, an alternative implementation (ker-nel relax multithread() ) using B threads to update anumber of boundary spin pairs concurrently in a threadblock is perfectly valid as similar to the self-labeling ap-proach only the number of necessary iterations is aﬀectedby the order of operations while the ﬁnal result is notchanged. As diﬀerent blocks can essentially only be syn-chronized between kernel calls, the stopping criterion ischecked on CPU in between kernel invocations. The par-allel run time for this kernel is then given by T p global = C relax (cid:96) min( (cid:96) , n ) B min( B, m ) (cid:96) d min (15)Note that the asymptotic eﬀort per spin from the relax-ation phase, T p global /L ∼ B − (cid:96) d min ∝ L d min , does notbecome constant as the system size is increased, unlessthe tile size B is scaled proportionally to L .For root ﬁnding in the spin-ﬂipping phase, it is of somerelevance that the relaxation process eﬀectively attachesall sub-clusters in tiles belonging to the same global clus-ter directly to the root of the sub-cluster with the small-est cluster label. Therefore, the algorithm involves pathcompression on the level of the coarse grained lattice. FIG. 7. (Color online) Hierarchical sewing of 64 tiles for labelconsolidation. On level 1, 2 ×

2. Hierarchical sewing

An alternative technique of label consolidation acrosstiles uses a hierarchical or divide-and-conquer approachas schematically depicted in Fig. 7 [23]. On the ﬁrst level2 × B × B spins are sewn together by insertingthe missing bonds crossing tile boundaries. This resultsin B/ × B/ B , either because tile labeling was performedwith the breadth-ﬁrst or self-labeling algorithms whichproduce labelings with complete path compression (i.e.,each node label points directly to the root), or since itwas done using union-and-ﬁnd with (at least) one of theingredients of tree balancing or path compression, lead-ing to (at most) logarithmic time complexity of ﬁnds.Then, using tree balancing in the hierarchical sewing stepensures that ﬁnd times remain logarithmically small astiles are combined. Time complexity could be furtherimproved by adding path compression, but (as for union-and-ﬁnd inside of tiles) it is found here that this ratherslows down the code in the range of lattice sizes consid-ered here. Note that the self-labeling approach does notnaturally provide the information about cluster sizes inthe tree roots. It is found, however, that it has no adverseeﬀect on the performance of the tile consolidation step ifcluster sizes are simply assumed to be identical (and, for T p g l o b a l / L L B = 4 B = 16 B = 64 FIG. 8. (Color online) Total parallel run times T p global forlabel consolidation via hierarchical sewing as a function ofsystem size L for diﬀerent tile sizes B for the q = 2 criticalPotts model on the GTX 480. The lines are ﬁts of the form T p global = a/L + b/B to the data. simplicity, equal to one) for partial clusters inside of tiles.One thread block is assigned to a conﬁguration of 2 × × k th generation is B k = 2 k B and the length of theseam is 2 × k B , the serial computational eﬀort for level n of the sewing is T sk = C sew (cid:18) L k B (cid:19) (2 × k B ) = C sew L B − k . (16)where I have neglected logarithmic terms due to theﬁnd operations. The total number of levels is k max =log ( L/B ) (assuming, for simplicity, that L and B arepowers of two). Hence, the total serial eﬀort is T s global = k max (cid:88) k =1 T sk = L C sew B k max (cid:88) k =1 (cid:18) (cid:19) k − = 2 L C sew B (cid:18) − BL (cid:19) , (17)On the GPU device with n multiprocessors mapped toindependent blocks available for the sewing procedure,the parallel run time for generation k is T pk = T sk min[( (cid:96) − k ) , n ] . (18)For a suﬃciently large system, at the beginning of theprocess the number of tiles ( (cid:96) − k ) to sew will always exceed n . As the number of remaining tiles is reduced,the number of sewing jobs will drop to reach the numberof multiprocessors at ( (cid:96) − k ∗ ) = n or k ∗ = log (cid:96) √ n , (19)where another approximation is made by allowing fornon-integer level numbers k . Due to the variable numberof multiprocessors actually involved in the calculation,the total parallel eﬀort has two contributions, T p global = k ∗ (cid:88) k =1 T sk n + k max (cid:88) k = k ∗ +1 T sk ( (cid:96) − k ) = C sew L nB k ∗ − k ∗ + 4 C sew B (2 k max − n ∗ )= C sew L (cid:20) nB + (cid:18) − √ n (cid:19) L (cid:21) . (20)Therefore, the eﬀort T p global /L per site becomes asymp-totically independent of L , but this limit is approachedrather slowly with a 1 /L correction, whereas eﬀects ofincomplete loading of the device decay as 1 /L (in twodimensions). This is illustrated by the numerical resultsshown in Fig. 8. The data are well described by the form T p global /L = aL + bB (21)expected from Eq. (20). Comparing Eqs. (20) and (21),from the ratio a/b of ﬁt parameters one can deduce theeﬀective number n of processing units as n = 25 + 8 a/b + (cid:112)

25 + 16 a/b , (22)and, for instance, the ﬁt at constant B = 16 yields n ≈ L = 8192 results in n ≈ n estimated are attributed to eﬀects ofthread divergence and the neglect of logarithmic terms inthe ﬁnd step. For tile size B = 16, the asymptotic perfor-mance of this kernel if found to be T p global /L = 0 .

78 ns.

D. Cluster ﬂipping

Having globally identiﬁed the connected componentsresulting from the bond conﬁguration, step 3 of the SWalgorithm described at the beginning of Sec. II consists ofassigning new, random spin orientations to each clusterand adapting the orientation of each spin in the cluster tothe new orientation prescribed. Since it is inconvenientto keep a separate list of global cluster roots, it is easiestto generate a new random spin orientation for each latticesite while only using this information at the cluster roots.0 T p g l o b a l / L [ n s ] L Eq. (26)data

FIG. 9. (Color online) Optimal tile size B opt in cluster iden-tiﬁcation with self-labeling and label relaxation for the q = 2states critical Potts model as a function of L . The solid lineshows the result of Eq. (26) and the dashed line representsthe optimum actually observed under the constraints B ≤ B = 2 n , n = 1, 2, . . . . To this end, the array of now superﬂuous bond activationvariables is re-used. In a ﬁrst kernel, prepare flip() , arandom orientation is drawn and stored in the bond ar-ray for each site. This is done in tiles of B x × B y sites asfor the bond activation, using the same array of random-number generators. In a second step (kernel flip() ),each site performs a ﬁnd operation to identify its root andapplies the new spin orientation found there to the localspin. Since cluster labels are eﬀectively stored in a treestructure, this step involves non-local memory accessesfor each site. In principle, locality could be improvedhere by employing full path compression in the unionsteps before, but in practice this is not found to improveperformance for the system sizes up to 16 384 ×

16 384considered here. Another possible improvement wouldeliminate the wasteful operation of drawing new proposedorientations for all spins while only the new orientationsof the cluster roots are required. This can be achievedby carrying the ﬂipping information piggy-back on thecluster labels, at least for the q = 2 or Ising model whereﬂipping information is only one bit wide. Again, however,in practice it is found that due to the incurred compli-cations in the arithmetics regarding cluster labels in ﬁndand union operations, overall performance is actually de-creased by this “optimization”. Due to the necessary treetraversal, the performance of the cluster ﬂipping proce-dure depends weakly on the degree of path compressionperformed previously in cluster labeling on tiles and labelconsolidation as well as on the tile size B . For the com-bination of self-labeling on tiles and hierarchical sewing,it is found to be T p ﬂip /L = 0 .

201 ns for L = 8192 and T p S W / L [ n s ]

16 64 256 1024 4096 16384 L label relaxationhierarchical sewing FIG. 10. (Color online) Total run times T p SW per spin innanoseconds for the Swendsen-Wang update of the q = 2critical Potts model on the GTX 480 from self-labeling on tilesplus label consolidation with label relaxation and hierarchicalsewing, respectively. Lines are guides to the eye. B = 16, while it is somewhat smaller at 0 .

133 ns if labelrelaxation is used instead of hierarchical sewing.

E. Performance and benchmarks

As a number of options for the cluster identiﬁcationtask have been discussed, the question arises which ofthem is the most eﬃcient for a given set of parametersand a given GPU device. For the bond activation andcluster ﬂipping steps, the situation is simpler as no im-portant variants haven been discussed there, such thatthese steps are always performed with the outlined localapproaches and tiles with B x = 256 and B y = 4, apartfrom the smallest systems with L < B (cid:46) T p identify /L = C local mn B d min + (cid:18) aL + bB (cid:19) , (23)assuming that B /k ≥ m in Eq. (12). Here, a and b are the parameters from Eq. (21). On minimizing, theoptimal tile size is then found to be B opt = (cid:18) bd min C local /mn (cid:19) / ( d min +1) . (24)1

32 128 512 2048 8192 L n s / ﬂ i p ﬂiphierarchicallabel localbondsCPU FIG. 11. (Color online) Break down of parallel run timesfor one SW update per spin into the components of bondactivation, local labeling, label consolidation via the sewingapproach and spin ﬂipping. The CPU curve shows referencedata for a serial implementation running on an Intel Core 2Quad Q6700 at 2.66 GHz.

The ﬁt parameters for the runs on the GTX 480 and d min = 1 .

08 then yield B opt ≈ .

2. Since, for simplicity,runs were restricted to L and B being powers of two, B = 16 is closest to the optimum. Similar ﬁts for thedata on the GTX 580 and the Tesla M2070 also used fortest runs yield the same optimum in the power-of-twostep sizes. The pre-asymptotic branch with B /k ≤ m inEq. (12) does not yield an optimum, but total run timesare monotonously decreasing with B . In other words, aslong as idle cores in the multiprocessors are available, thetile size should be increased. B = 16 hence is also theglobal optimum for this setup.For the combination of self-labeling and label relax-ation, the total run time for an update is T p identify /L = C local mn B d min + C relax mn L d min B d dim +1 , (25)such that the optimal tile size becomes B opt = (cid:18) C relax ( d min + 1) C local d min L d min (cid:19) / (2 d min +1) , (26)which (with d min ≈ .

08) is approximately proportionalto L / for the critical q = 2 Potts model in two dimen-sions. Figure 9 shows the resulting optimal tile size asa function of L . Due to the limitation of shared mem-ory to 48 kB on current NVIDIA GPUs, self-labeling ontiles is limited to block sizes B ≤

64 (assuming B = 2 n , n = 1, 2, . . . ), such that the optimal tile sizes cannot beused for L (cid:38) . . . . . . . . . β/β c T p S W [ n s ] relax, B = 16 relax, B = 32 relax, B = 64 sew, B = 64 sew, B = 32 sew, B = 16 FIG. 12. (Color online) Run times for the SW update on theGTX 480 as a function of the inverse temperature β for therelaxation and sewing approaches and diﬀerent tile sizes. is no option as it slows down the code dramatically. Us-ing breadth-ﬁrst search or union-and-ﬁnd on larger tilesis feasible, but does signiﬁcantly increase the total runtime, even though the relaxation phase is slightly moreeﬃcient. I therefore did not increase the tile size beyond B = 64, as indicated in Fig. 9.The resulting total run times on the GTX 480 areshown in Fig. 10. The two consolidation approaches leadto quite diﬀerent size dependence. Tile relaxation resultsin a rather fast decay of run times per site in the under-utilized regime and is faster than the sewing approachfor intermediate system sizes. Eventually, however, thescaling T p identify /L ∼ L d d min+1 implied by Eqs. (25) and (26) kicks in, which amounts to T p identify /L ∼ L . for d min = 1 .

08, and results in theupturn seen in Fig. 10. For the hierarchical approach,on the other hand, as implied by Eq. (23) the best per-formance is reached only rather slowly as L is increased,but T p identify /L ultimately becomes constant as (theoret-ically) L → ∞ . At L = 8192 and β = β c for the q = 2Potts model, SW with sewing performs at 3 .

18 ns perspin and per sweep on the GTX 480, while relaxationresults in 7 .

15 ns per sweep. For the pure cluster iden-tiﬁcation problem, i.e., without the bond activation andspin ﬂipping steps, these times are reduced to 2 .

52 ns and6 .

56 ns, respectively. Figure 11 shows the break downof run times into the algorithmic components of bondactivation, labeling on tiles, tile consolidation and spinﬂipping when using hierarchical sewing. Label consoli-dation is the dominant contribution up to intermediatesystem sizes, and only for L ≥

16 384 its run time drops2

32 128 512 2048 8192 L s p ee dup GTX 480, relaxationGTX 580GTX 480Tesla M2070, ECC onTesla M2070, ECC off

FIG. 13. (Color online) Speed-up of the Swendsen-Wang up-date for the q = 2 critical Potts model on GPU as comparedto CPU (Intel Q6700 at 2.66 GHz) as a function of systemsize L . The circles show results from using the relaxation pro-cedure while the remaining data sets are for the hierarchicalsewing process on the GTX 480, GTX 580 and Tesla M2070GPUs, respectively. The latter is shown in variants with andwithout error-correcting code (ECC). below that of local labeling on tiles. For smaller sys-tems, the fraction of time spent on bond activation andspin ﬂipping is negligible, while (due to the decrease intime spent for label consolidation) it rises to about 20%for L = 8192. As a reference, Fig. 11 also shows therun time of an optimized, serial CPU implementationusing breadth-ﬁrst search and on-line ﬂipping of spins asthe clusters are grown, running on an Intel Core 2 QuadQ6700 at 2.66 GHz.The incipient percolating clusters for the Potts modelsimulations at β c are typical for a critical model. Forother applications, for instance in image segmentation,it is interesting to investigate the performance for moregeneral situations. Figure 12 displays the run times for anSW update as a function of inverse temperature β , com-paring the setups with relaxation and hierarchical sewingfor tile consolidation. There is a natural increase in runtime with the concentration p = 1 − exp( − βJ ) of bonds.While for the sewing procedure, run times increase mono-tonically with β , for the relaxation approach there is apronounced peak of run times near β = β c , where thenumber of necessary iterations n relax shoots up since nowinformation about the incipient percolating cluster needsto be transmitted across the whole system. Run timesbecome somewhat smaller again for β > β c as most bondscrossing tile boundaries belong to the same (percolating)cluster such that, due to concurrency, cluster labels cantravel several steps in one iteration. This concurrencyeﬀect strongly increases as more tiles are treated simul- taneously which, for a ﬁxed number of multiprocessors,is the case for larger tile sizes. Figure 12 shows thatthe preference of the sewing procedure over relaxationfor large systems is robust with respect to variations intemperature and should also be justiﬁed for more generalstructures not resulting from a percolation transition.Figure 13 shows the speed-up of the GPU implementa-tion with respect to the CPU code on the Q6700 proces-sor. For large systems, speed-ups in excess of 30 are ob-served. Comparing diﬀerent GPU devices, a clear scalingwith the number of multiprocessors and global memorybandwidth is observed with the best performance seen forthe GTX 580 ( n = 16, 192 GB/s), followed by the GTX480 ( n = 15, 177 GB/s) and the Tesla M2070 ( n = 14,144 GB/s). Naturally, eﬀects of higher double-precisionﬂoating-point performance of the latter are not seen forthe present code, which almost exclusively uses integerand a few single-precision ﬂoating point arithmetic in-structions. The penalty for activating error correction(ECC) on the memory is minute. Some benchmark re-sults, also including diﬀerent processors, are collected inTab. I. III. WOLFF ALGORITHM

For simulations of spin models, Wolﬀ [5] suggested avariant of the Swendsen-Wang algorithm where only asingle cluster, seeded at a randomly chosen site, is grownat a time which is then always ﬂipped. Empirically, it isfound that this leads to somewhat smaller autocorrela-tion times than SW [48, 49] but, most likely, no changein the dynamical critical exponent (at least for integer q ) [50]. Conceptually, one can imagine the single-clusteralgorithm as a variant of the SW dynamics where aftera full decomposition of the lattice according to the SWprescription, a site is picked at random and the clusterof spins it belongs to is ﬂipped. Since the probability ofpicking a speciﬁc cluster in this way is proportional to itssize, in this approach larger clusters are ﬂipped on aver-age than in the original SW algorithm. This explains thesomewhat reduced autocorrelation times.While this approach is easily coded in a serial programand, in addition to the smaller autocorrelation times, in asuitable implementation performs at even somewhat lesseﬀort per spin than the SW algorithm, it is not straight-forwardly parallelized [42, 51–53]. The only obvious par-allelism lies in the sites at the wave front of the growingcluster, cf. the sketch in Fig. 1. A number of approachesfor parallel calculations come to mind:(a) A full parallel cluster labeling as in SW, followed bypicking out and ﬂipping a single cluster. Althoughmany operations are wasteful here, there might stillbe a speed-up as compared to the serial code. If us-ing a relaxation procedure for label consolidation,this approach can be somewhat improved by mod-ifying the stopping criterion to only focus on thelabels belonging to the cluster to be ﬂipped.3 TABLE I. Benchmark results for the Swendsen-Wang update of the q = 2, q = 3 and q = 4 Potts models on GPU vs. CPU. q β/β c L method CPU GPU speedup2 1 512 self-labeling, relaxation Q6700 61 .

63 ns GTX 480 6 .

533 ns 92 1 512 self-labeling, sewing Q6700 61 .

63 ns GTX 480 12 .

17 ns 52 1 8192 self-labeling, sewing Q6700 77 .

39 ns GTX 480 3 .

179 ns 242 1 8192 self-labeling, sewing Q6700 77 .

39 ns GTX 580 2 .

697 ns 292 1 8192 self-labeling, sewing i7@9300 105 . .

697 ns 392 1 8192 self-labeling, sewing Q6700 77 .

39 ns M2070 3 .

934 ns 202 1 8192 self-labeling, sewing E5620 149 . .

934 ns 382 1 16 384 self-labeling, sewing E5620 152 . .

573 ns 432 1 8192 self-labeling, relaxation Q6700 77 .

39 ns GTX 480 7 .

154 ns 112 0 . .

12 ns GTX 480 1 .

863 ns 312 1 . . .

164 ns 333 1 8192 self-labeling, sewing Q6700 70 .

73 ns GTX 480 3 .

059 ns 234 1 8192 self-labeling, sewing Q6700 65 .

51 ns GTX 480 2 .

887 ns 23 (b) Restriction to wave-front parallelism [51]. Due tothe rather variable number of sites at the front,however, this generically leads to poor load balanc-ing between the processing units. Load balancingcan be improved by a de-localization of the wavefront with a “randomized” rearrangement of thelattice. This can be reached, for instance, with ascattered strip partitioning, where strips of the lat-tice are assigned to available processing units in around-robin fashion, leading to a more even distri-bution of sites at the wave front to diﬀerent pro-cessors [52].(c) Suitable modiﬁcations of the single-cluster algo-rithm to make it more appropriate for parallel com-putation.The approach (a) can be easily realized with the tech-niques outlined in Sec. II. As discussed in Ref. [52], addi-tional load balancing can result in signiﬁcant improve-ments on MIMD (multiple instruction, multiple data)machines. It appears less suitable for the mixed architec-ture of GPUs. In contrast to the more general case of SWdynamics discussed above in Sec. II, I refrain here from acomprehensive evaluation of options, and only give somepreliminary results for a modiﬁcation (c) of the Wolﬀalgorithm appearing suitable for GPU computing.In this approach, the lattice is again decomposed intotiles of edge length B . A single cluster per tile is thengrown using a number of threads per tile to operate onthe wave front. Unlike the case of the SW implementa-tion, the clusters are not conﬁned to the tiles, but cangrow to span the whole lattice. One can easily convinceoneself, that the underlying decomposition remains to bethe SW one. If seeds in diﬀerent tiles turn out to belongto the same cluster, diﬀerent parts of that cluster receivediﬀerent labels, but since all clusters are ﬂipped the ef-fect is the same as if a single cluster (for that two seeds)had been grown (this is for the case of the q = 2 model).Logically, this algorithm is identical to performing thefull SW decomposition and then selecting (cid:96) points onthe lattice, followed by ﬂipping all distinct clusters thesepoints belong to. While this approach satisﬁes detailed balance (the SW decomposition remains the same andthe cluster ﬂipping probability is symmetric), it is notergodic as it stands since, for instance, it becomes im-possible to ﬂip only a single spin. This deﬁciency can beeasily repaired, however, by assigning a ﬂipping probabil-ity p ﬂip < p per block is chosen. If there are enough pend-ing sites in the queue, each thread is assigned one ofthese spins which are then examined in parallel. Thequeue is here realized as a simple linear array of size N = L . This appears ineﬃcient as the size of the wavefront will at most be of the order of L d H , where d H is thefractal dimension of the cluster boundary. In contrastto the use of a ring buﬀer of length ∝ L d H , however,storing in and retrieving from the queue can be realizedwith atomic operations only [30], i.e., without the use oflocks. Unfortunately, this setup severely limits the rangeof realizable tile sizes for larger systems as memory re-quirements for this queue scale as (cid:96) N = L B − . Incontrast to the SW algorithm, bond activation and spinﬂipping can be done online with the labeling pass. Con-sequently, the “few-cluster” implementation needs onlytwo kernels, cluster tile() for the labeling and ﬂip-ping and reset inclus() for resetting the cluster labelsafter each pass. The number p of threads per block isadapted to maximize occupancy of the device. In gen-eral, it is found that good results are obtained on setting p = min (cid:2) , / min (cid:0) max( (cid:96) /n, , (cid:1)(cid:3) , (27)as 1024 is the maximum number of threads per block,1536 is the maximum number of active threads and 8 themaximum number of resident blocks per multiprocessor.(Here, n denotes the number of multiprocessors of thedevice.) The resulting speed-ups as compared to a se-rial code on the Intel Core 2 Quad Q6700 are shown inFig. 14. The performance for large system sizes is lim-ited by the memory consumption of the queues, limiting4

32 64 128 256 512 1024 2048 4096 L s p ee dup GTX 480Tesla M2070

FIG. 14. (Color online) Speed-up of the “few-cluster” updatedescribed in Sec. III implemented on GPU as compared to asingle-cluster update on CPU. For each system size, the opti-mal tile size B has been selected from the range of allowabletile sizes determined by memory constraints. the number (cid:96) of tiles. Speed-ups by a factor of up toabout ﬁve are achieved, signiﬁcantly lower than for theSW dynamics. It is expected than further optimizations(such as the use of ring buﬀers instead of queues) couldapproximately double this speed-up. Nevertheless, forcluster-update simulations on GPUs it might be moreeﬃcient to stick with the SW algorithm. CONCLUSIONS

Cluster identiﬁcation is a pivotal application in scien-tiﬁc computations with applications in the simulation ofspin models and percolation, image processing or net-work analysis. While the underlying problem is inher- ently non-local in nature, the choice of appropriate algo-rithms for implementations on GPU allows for signiﬁcantperformance gains as compared to serial codes on CPU.The overall speed-up is seen to be lowest for spin mod-els at criticality, where clusters are fractal and span thesystem. In all cases, however, speed-ups up to about 30can be achieved on current GPU devices. This is to becontrasted to the case of purely local algorithms, such asMetropolis simulations of spin models, where speed-upsare seen to be larger by a factor three to ﬁve [18, 20, 21].Even with this caveat, it seems clear that GPU com-puting is not limited to the case of purely local prob-lems as signiﬁcant performance gains can be achieved forhighly non-local problems also. Generalizations withinthe realm of spin-model simulations, such as variants ondiﬀerent lattices or embedded clusters for O( n ) spin mod-els [5] are straightforward.While the considerations presented here have been re-stricted to calculations on a single GPU, it should beclear that the approach outlined for the Swendsen-Wangdynamics or the pure cluster identiﬁcation problem iseasily parallelized across several GPUs. For the case ofspin-model simulations, the combination of self-labelingand label relaxation appears better suited for this taskas for the ﬁnal spin-ﬂipping step only information localto each GPU is required, whereas for the hierarchicalscheme cluster roots (and therefore spin-ﬂipping states)are scattered throughout the whole system. The mosteﬀective setup for simulating large systems, therefore ap-pears to be the combination of self-labeling and hierarchi-cal sewing inside of a GPU and label relaxation betweenGPUs which can easily be realized using MPI with ratherlow communication overheads. ACKNOWLEDGMENTS

Support by the DFG through the Emmy Noether Pro-gramme under contract No. WE4425/1-1 and by theSchwerpunkt f¨ur rechnergest¨utzte Forschungsmethodenin den Naturwissenschaften (SRFN Mainz) is gratefullyacknowledged. [1] R. J. Baxter,

Exactly Solved Models in Statistical Me-chanics (Academic Press, London, 1982).[2] K. Binder and D. P. Landau,

A Guide to Monte CarloSimulations in Statistical Physics , 3rd ed. (CambridgeUniversity Press, Cambridge, 2009).[3] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth,A. H. Teller, and E. Teller, J. Chem. Phys. , 1087(1953).[4] R. H. Swendsen and J. S. Wang, Phys. Rev. Lett. , 86(1987).[5] U. Wolﬀ, Phys. Rev. Lett. , 361 (1989).[6] D. Kandel and E. Domany, Phys. Rev. B , 8539 (1991). [7] N. Kawashima and J. E. Gubernatis, Phys. Rev. E ,1547 (1995).[8] R. Opara and F. W¨org¨otter, Neural Comput. , 1547(Aug. 1998).[9] X. Wang and J. Zhao, International Conference on Nat-ural Computation , 352 (2008).[10] D. Stauﬀer and A. Aharony, Introduction to PercolationTheory , 2nd ed. (Taylor & Francis, London, 1994).[11] J. Hoshen and R. Kopelman, Phys. Rev. B , 3438 (Oct.1976).[12] F. Belletti, M. Cotallo, A. Cruz, L. A. Fern´andez,A. G. Guerrero, M. Guidetti, A. Maiorano, F. Manto-vani, E. Marinari, V. Mart´ın-Mayor, A. Mu˜noz Sudupe, D. Navarro, G. Parisi, S. P. Gaviro, M. Rossi, J. J. Ruiz-Lorenzo, S. F. Schifano, D. Sciretti, A. Taranc´on, andR. L. Tripiccione, Comput. Sci. Eng. , 48 (2009).[13] H. W. J. Bl¨ote, L. N. Shchur, and A. L. Talapov, Int. J.Mod. Phys. C , 1137 (1999).[14] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E.Stone, and J. C. Phillips, Proceedings of the IEEE , Pro-ceedings of the IEEE , 879 (May 2008).[15] D. B. Kirk and W. W. Hwu, Programming Massively Par-allel Processors (Elsevier, Amsterdam, 2010).[16] T. Preis, P. Virnau, W. Paul, and J. J. Schneider, J.Comput. Phys. , 4468 (2009).[17] M. Bernaschi, G. Parisi, and L. Parisi, Comput. Phys.Commun. , 1265 (Jun. 2011).[18] M. Weigel, “Performance potential for simulating spinmodels on GPU,” (2011), preprint arXiv:1101.1427.[19] E. E. Ferrero, J. P. De Francesco, N. Wolovick, and S. A.Cannas, “ q -state Potts model metastability study usingoptimized GPU-based Monte Carlo algorithms,” (Jan.2011), preprint arXiv:1101.0876, arXiv:1101.0876.[20] M. Weigel, Comput. Phys. Commun. , 1833 (2011).[21] M. Weigel and T. Yavors’kii, Physics Procedia(2011), inprint.[22] D. Heermann and A. N. Burkitt, Parallel Comput. ,345 (1990).[23] C. F. Baillie and P. D. Coddington, Concurrency: Pract.Exper. , 129 (1991).[24] H. Mino, Comput. Phys. Commun. , 25 (1991).[25] J. Apostolakis, P. Coddington, and E. Marinari, Euro-phys. Lett. , 189 (Jan. 1992).[26] G. T. Barkema and T. MacFarland, Phys. Rev. E ,1623 (Aug. 1994).[27] M. Bauernfeind, R. Hackl, H. G. Matuttis, J. Singer,T. Husslein, and I. Morgenstern, Physica A , 277(1994).[28] M. Flanigan and P. Tamayo, Physica A , 461 (1995).[29] J. Mart´ın-Herrero, J. Phys. A , 9377 (2004).[30] “CUDA zone — Resource for C developers of applicationsthat solve computing problems,” .[31] F. Y. Wu, Rev. Mod. Phys. , 235 (1982).[32] C. M. Fortuin and P. W. Kasteleyn, Physica , 536(1972).[33] C. K. Hu, Phys. Rev. B , 5103 (1984).[34] A. Coniglio and W. Klein, J. Phys. A , 2775 (1980). [35] X. J. Li and A. D. Sokal, Phys. Rev. Lett. , 827 (Aug.1989).[36] Y. Deng, T. M. Garoni, J. Machta, G. Ossola, M. Polin,and A. D. Sokal, Phys. Rev. Lett. , 055701 (Jul. 2007).[37] I do not take the eﬀects of latency hiding and otherscheduling speciﬁcities into account in the scaling formu-lae, but discuss them in some places in connection withobserved deviations from these simpliﬁed laws. It is alsoassumed that the number of threads per block is at least4 since due to the limitation to eight active blocks permultiprocessor on current NVIDIA GPUs, there wouldotherwise be idle cores.[38] M. Matsumoto and T. Nishimura, ACM Trans. Model.Comput. Simul. , 3 (Jan. 1998).[39] J. E. Gentle, Random number generation and MonteCarlo methods , 2nd ed. (Springer, Berlin, 2003).[40] A. E. Ferdinand and M. E. Fisher, Phys. Rev. , 832(1969).[41] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,

Introduction to Algorithms , 3rd ed. (MIT Press, 2009)ISBN 9780262533058.[42] D. A. Bader and K. Madduri, in

Proc. The 35th In-ternational Conference on Parallel Processing (ICPP) (Columbus, Ohio, 2006) pp. 523–530.[43] R. E. Tarjan, J. ACM , 215 (Apr. 1975).[44] M. E. J. Newman and R. M. Ziﬀ, Phys. Rev. E , 016706(2001).[45] H. J. Herrmann and H. E. Stanley, J. Phys. A , L829(Sep. 1988).[46] E. Miranda, Physica A , 229 (Jul. 1991).[47] Note that this still leads to some under-utilization of thedevice due to the limit of eight active blocks per multi-processor which requires at least four threads per blockfor full occupancy with blocks.[48] U. Wolﬀ, Phys. Lett. B , 379 (1989).[49] C. F. Baillie and P. D. Coddington, Phys. Rev. B ,10617 (May 1991).[50] Y. Deng, X. Qian, and H. W. J. Bl¨ote, Phys. Rev. E ,036707 (Sep. 2009).[51] H. G. Evertz, J. Stat. Phys. , 1075 (Feb. 1993).[52] S. Bae, S. H. Ko, and P. D. Coddington, Int. J. Mod.Phys. C , 197 (1995).[53] J. Kaupuˇzs, J. Rimˇsans, and R. V. N. Melnik, Phys. Rev.E81