[PDF] Generating SU(Nc) pure gauge lattice QCD configurations on GPUs with CUDA

Abstract

The starting point of any lattice QCD computation is the generation of a Markov chain of gauge field configurations. Due to the large number of lattice links and due to the matrix multiplications, generating SU(Nc) lattice QCD configurations is a highly demanding computational task, requiring advanced computer parallel architectures such as clusters of several Central Processing Units (CPUs) or Graphics Processing Units (GPUs). In this paper we present and explore the performance of CUDA codes for NVIDIA GPUs to generate SU(Nc) lattice QCD pure gauge configurations. Our implementation in one GPU uses CUDA and in multiple GPUs uses OpenMP and CUDA. We present optimized CUDA codes SU(2), SU(3) and SU(4). We also show a generic SU(Nc) code for Nc ≥4 and compare it with the optimized version of SU(4). Our codes are publicly available for free use by the lattice QCD community.

Full PDF

GGenerating SU( N c ) pure gauge lattice QCD conﬁgurationson GPUs with CUDA Nuno Cardoso ∗ , Pedro Bicudo CFTP, Departamento de Física, Instituto Superior Técnico, Universidade Técnica de Lisboa, Av. Rovisco Pais, 1049-001Lisbon, Portugal

Abstract

The starting point of any lattice QCD computation is the generation of a Markov chain of gauge ﬁeldconﬁgurations. Due to the large number of lattice links and due to the matrix multiplications, generatingSU( N c ) lattice QCD conﬁgurations is a highly demanding computational task, requiring advanced computerparallel architectures such as clusters of several Central Processing Units (CPUs) or Graphics ProcessingUnits (GPUs). In this paper we present and explore the performance of CUDA codes for NVIDIA GPUsto generate SU( N c ) lattice QCD pure gauge conﬁgurations. Our implementation in one GPU uses CUDAand in multiple GPUs uses OpenMP and CUDA. We present optimized CUDA codes for SU(2), SU(3) andSU(4). We also show a generic SU( N c ) code for N c ≥ and compare it with the optimized version of SU(4).Our codes are publicly available for free use by the lattice QCD community. Keywords:

CUDA, GPU, Fermi, Lattice Gauge Theory, SU(2), SU(3), SU(4), SU(Nc)

1. Introduction

Generating SU( N c ) lattice conﬁgurations is a highly demanding computational task and requires ad-vanced computer architectures such as clusters of several Central Processing Units (CPUs) or GraphicsProcessing Units (GPUs). Compared with CPU clusters, GPUs are easier to access and maintain, as theycan run on a local desktop computer.CUDA (Compute Uniﬁed Device Architecture) is NVIDIA’s parallel computing architecture which en-ables dramatic increases in performance computing using GPUs. Since 2007, the year when NVIDIA releasedCUDA for GPU computing as a language extension to C, CUDA has become a standard tool in the scientiﬁccommunity. The CUDA architecture also supports standard languages, such as C and Fortran, and APIs forGPU computing, such as OpenCL and DirectCompute. With GPUs, many scientiﬁc problems can now beaddressed without the need to use a cluster of CPUs or by using clusters of GPUs in which the computationtime can be reduced signiﬁcantly, for example [1, 2, 3, 4, 5, 6, 7, 8].The most successful theories of elementary particle physics are the gauge theories. They are renor-malizable [9] and have Lie groups as internal gauge symmetries. The special unitary groups SU( N c ) arecornerstones of the the Standard Model of particle physics, especially SU(2) together with U(1) in the elec-troweak interaction and SU(3) in quantum chromodynamics (QCD), the theory of the strong interaction.Moreover, since QCD is not yet fully understood, it is relevant to study the eﬀect of changing the number ofcolors, N c , studying other SU( N c ) groups. In particular, since the seminal works of ’t Hooft [10], Witten [11]and Creutz [12, 13, 14], the large N c limit of QCD has been explored in great detail [15, 16, 17, 18, 19, 20, 21]. ∗ Corresponding author

Email addresses: [email protected] (Nuno Cardoso), [email protected] (Pedro Bicudo) a r X i v : . [ h e p - l a t ] O c t owever, the only existing approach to study SU( N c ) gauge theories namely in strong interactions andbeyond in a non-perturbative way is the lattice ﬁeld theory. Based on the path integral formalism and instatistical mechanics methods, the observables are evaluated numerically using Monte Carlo simulations.The starting point of any lattice QCD computation is the generation of a Markov chain of gauge ﬁeldconﬁgurations. The conﬁguration generation, due to the large number of lattice links and to the matrixmultiplications, is computationally expensive. Thus, in the lattice community many groups [22, 23, 24, 25,26, 27, 28, 29, 30, 31, 32, 33, 34] are already using GPUs to generate lattice QCD conﬁgurations. They arespecialized in SU(3) and in using the GPUs for the full Lagrangian description, i.e., gluons together withdynamical quarks.Here we describe and study the performance of our conﬁguration generation codes in pure gauge latticeQCD. Pure lattice gauge theory does not include the full Lagrangian description, i.e., the quarks are ﬁxed andtherefore this approach is also denominated as quenched approximation. In the quenched lattice approach,the required computational power to generate pure gauge conﬁgurations, although quite intensive, is oneor two orders of magnitude less demanding than the full lattice approach. Although quenched QCD is asimpliﬁcation of QCD, there are still many problems that remain to be solved in that approach. In particularit is computationally feasible to study quenched QCD with a larger N c .Recently [7], we addressed the GPU computational power necessary to generate pure gauge SU(2) conﬁg-urations. We showed that a server with a single CPU commanding a few GPUs is quite eﬃcient to generategauge SU(2) conﬁgurations with codes including the Open Multi-Processing (OpenMP) and CUDA libraries.Here we extend our previous code for SU(2) to SU(3), SU(4) and to a generic SU( N c ) code.This paper is divided into 5 sections. In section 2, we present a brief description on how to generatelattice SU( N c ) conﬁgurations with N c ≥ and in section 3 we show the implementation in one GPU usingCUDA or in multiple GPUs using OpenMP and CUDA. In section 4, we present the GPU performance insingle and double precision using one, two and three GPUs for diﬀerent lattice volumes. Finally, in section5, we conclude.

2. Lattice Gauge Theory

Gauge theories can be addressed by lattice ﬁeld theory in a non-perturbative approximation scheme,based on the path integral formalism in which space-time is discretized on a 4-dimensional hypercubiclattice. In a lattice, the ﬁelds representing the fermions are deﬁned in lattice sites, while the gauge ﬁelds arerepresented by link variables U µ ( x ) connecting neighboring sites. In this work, we employ the pure gaugequenched approximation.In quenched lattice QCD, quantities in the form of a path integral are transformed to Euclidean space-time, and are evaluated numerically using Monte Carlo simulations allowing us to use statistical mechanicsmethods. The partition function in Euclidean space-time, in the quenched approximation, is given by Z = ˆ D [ U ] exp ( − S G [ U ]) , (1)where the integration measure for the link variables is the product measure ˆ D [ U ] = (cid:89) s (cid:89) µ =0 ˆ dU µ ( s ) , (2)and S G [ U ] is the gauge ﬁeld action. Here we use the Wilson gauge action, deﬁned by S G [ U ] = βN c (cid:88) s (cid:88) µ<ν ReTr [1 − P µν ( s )] (3)where β = 2 N c /g is the inverse of the gauge coupling and P µν ( s ) is the plaquette, Fig. 1, deﬁned as P µν ( s ) = U µ ( s ) U ν ( s + ˆ µ ) U † µ ( s + ˆ ν ) U † ν ( s ) . (4)2 U µ ( s ) U ν ( s + ˆ µ ) U † µ ( s + ˆ ν ) U † ν ( s ) s s + ˆ µs + ˆ µ + ˆ νs + ˆ ν ν µ Figure 2: Plaquette P µν ( s ).2 Figure 1: Plaquette P µν ( s ) .Physical observables are obtained calculating the expectation value, (cid:104)O(cid:105) = 1 Z ˆ D [ U ] exp ( − S G [ U ]) O [ U ] , (5)where O is given as a combination of operators expressed in terms of time-ordered products of ﬁelds.We set up our pure gauge lattice on a 4-dimensional hypercubic lattice with spacing a , time-like extent T = N t and spatial size L = N x × N y × N z = N σ , with periodic boundary conditions. The lattice QCDMonte Carlo simulations approximate the integral by an average of the observable evaluated on N samplegauge ﬁeld conﬁgurations with distribution probability ∝ exp( − S G [ U ]) . The sequence of conﬁgurationsgenerated by Monte Carlo algorithms produces a Markov chain. Each Monte Carlo step consists in visitingand updating all gauge links in the lattice. There are diﬀerent algorithms to update a gauge link, such asMetropolis and heatbath. We will use the heatbath method, since it is more eﬃcient than the Metropolisalgorithm.The gauge ﬁeld variables of an SU( N c ) gauge group are represented on the lattice links by complex N c × N c matrices. Using the unitarity of the group elements, we may use a minimal set of parameters equalto the number of generators of the group. However, in practical calculations, it is more convenient to use aredundant parameterization of the gauge group. For example, the SU(2) group can be represented by fourreal numbers instead of using × complex matrices, since a × matrix parameterized with, U = a + i a · σ (6)with a = a + a = 1 (7)is an SU(2) matrix, and vice versaTr U = 2 a , U U † = U † U = , det U = 1 . (8)Generating SU( N c ) lattice conﬁgurations is a highly demanding computational task. Most of the com-putational time is spent in updating the gauge links. We start by describing the update link method forthe SU(2) group, since this is the basis for the groups with N c ≥ . In order to update a particular link inSU(2), [35, 36], we need only to consider the contribution to the action from the six plaquettes containingthat link, the staple Σ , Σ = (cid:88) µ (cid:54) = ν ( U x,ν U x +ˆ ν,µ U † x +ˆ µ,ν + U † x − ˆ ν,ν U x − ˆ ν,µ U x − ˆ ν +ˆ µ,ν ) . (9)The distribution to be generated for every single link is given by dP ( U ) ∝ exp (cid:20) β Tr ( U Σ) (cid:21) . (10)3pplying a useful property of SU(2) elements, that any sum of them is proportional to another SU(2)element ˜ U , ˜ U = Σ √ det Σ = Σ k . (11)and using the invariance of the group measure, we obtain dP (cid:16) U ˜ U − (cid:17) ∝ exp (cid:20) βk Tr U (cid:21) dU = exp [ βka ] 12 π δ (cid:0) a − (cid:1) d a . (12)Thus, we need to generate a ∈ [ − , with distribution, P ( a ) ∝ (cid:113) − a exp ( βka ) . (13)and the components of a are generated randomly on the 3D unit sphere in a 4-dimensional space withexponential weighting along the a direction. Once the a and a are obtained in this way, the new link isupdated, U (cid:48) = U ˜ U − . (14)Although for SU( N c ) with N c ≥ there is no heatbath algorithm which directly produces SU( N c ) linkvariables, we can apply a pseudo-heatbath method, also known as the Cabibbo-Marinari algorithm [37], forthe SU(2) subgroups of SU( N c ). The procedure to update a link for N c ≥ is1. calculate the staple, Σ ;2. calculate the U Σ † ;3. select a set of SU(2) subgroups of SU( N c ) from the previous result, such that there is no subset ofSU( N c ) left invariant under left multiplication, except the whole group;4. although the minimal set may involve only N c − subgroups, here we decide to use the complete set,i.e., N c ( N c − / subgroups;5. the update of a given link is done in k steps, k = 1 , ..., N c ( N c − / . In each step is generateda member of A k ∈ SU(2) k . Then the current link at that step is obtained by multiplying the linkobtained in the last step by A k , U k = A k U k − . (15)For example, although in N c = 3 two subgroups would be suﬃcient to cover the whole group space, forsymmetry reasons we will use all subgroups. In N c = 3 , the three subgroups of SU(3), S ij with ≤ i < j ≤ matrices are constructed as  a a a a

00 0 1   a a a a   a a a a  (16)with a ij ∈ SU(2). Each of the N c ( N c − / subgroups is determined using the heatbath for SU(2) as alreadydiscussed.In order to accelerate the decorrelation of subsequent lattice conﬁgurations, we can employ the over-relaxation algorithm, which in SU(2) is deﬁned by U new = Σ † | Σ | U † old Σ † | Σ | , (17)with | Σ | = √ det Σ . Although for SU(2) group this is straightforward, for the SU( N c ) with N c ≥ this isnot the case. However, the method is similar to the pseudo-heatbath method and we can use the aboveequation for each of the SU(2) subgroups of SU( N c ).Because of the accumulation of rounding errors in the multiplications of the group elements, the matriceshave to be regularly projected to unitarity. This step in the algorithm is called re-unitarization. Re-unitarization for N c = 2 is done by normalizing the ﬁrst row and then reconstructing the second row fromthe ﬁrst. For N c ≥ , this is done using the well-known Gram-Schmidt method for building an orthonormalbasis element in vector spaces. For N c > , after the Gram-Schmidt method, we need to multiply the lastrow with a phase to make the determinant equal to one.4 . Implementation In this section, we discuss the parallelization scheme for generating pure gauge SU( N c ) lattice conﬁgu-rations on GPUs using CUDA, with optimized codes for N c = 2 , , and a generic code for N c ≥ .We implement and test our codes for generating pure gauge lattice conﬁgurations in CUDA version 3.2.For our 4-dimensional lattice, we address one thread per lattice site. Version 3.2 only supports thread blocksup to 3D and grids up to 2D, and the lattice needs four indexes. Therefore we compare 1D thread blockswhere we reconstruct all the indexes on the ﬂy with 3D thread blocks, one for t, one for z and one for bothx and y and then reconstruct the other index inside the kernel.Since the grid can have only 65535 thread blocks per dimension, for a large lattice volume this number isinsuﬃcient, i.e., using one thread per site, we can only have ( lattice volume ) / ( number of threads ) ≤ .We put most of the constants needed by the GPU, like the number of points in the lattice, in the GPUconstant memory using cudaMemcpyToSymbol. For the lattice array we cannot use a 4-dimensional arrayto store the lattice in CUDA. Therefore, we use a 1D array with size N x × N y × N z × N t × Dim , with

Dim = 4 .For N c = 2 , we only need four ﬂoating point numbers per link instead of having a × complexmatrix, therefore we use an array of structures. Each array position is a ﬂoat4/double4 structure to storethe generators of SU(2) (a0, a1, a2 and a3). When accessing the global memory, 128-bit words give fullycoalesced memory transactions. Although in single precision we can use a ﬂoat4 (128-bit word) array tostore all the generators of SU(2), this is not the case in double precision. Using a double4 format does notgive fully coalesced memory transactions since it is a 256-bit word, whereas the double2 format is a 128-bitword and gives fully coalesced memory transactions. Therefore, we also implemented an array of double2.First we store the ﬁrst two generators, a0 and a1, for all the lattice size and then the last two generators,a2 and a3.For N c = 3 and , we tested an array of structures (AOS) and a structure of arrays (SOA). Sinceeach element of SU(3) and SU(4) is a complex number, we use the ﬂoat2/double2 CUDA vector types.Therefore, in a AOS each array index is a structure with N c × N c ﬂoat2/double2 elements. The SOAstructure is composed by N c × N c arrays of type ﬂoat2/double2. To select single or double precision, wehave implemented templates in the code.In the SU(3) case, we have also implemented an SOA with 12 arrays of type ﬂoat2/double2. As discussedin section 2, we can use a minimal set of parameters equal to the number of generators of the group. InSU(3), the minimum is 8 parameters, however this is not numerically stable, [23, 28]. But if we use 12parameters instead of 8 parameters, storing only the ﬁrst two rows and reconstructing the third row on theﬂy, the truncation errors are smaller and we can reduce memory traﬃc.We now detail the structures used for the SU( N c ) conﬁguration codes with diﬀerent N c . Since the N c ≥ implementation is similar to the one of the SU(3) code, we show in more detail the SU(3) implementation.The diﬀerence between the structures of these codes is the number of elements, say, 18/12 real numbers forthe SU(3) implementation with full and 12 real numbers matrix parameterization, and 32 real numbers forthe SU(4) implementation.For generating pure gauge lattice conﬁgurations, we implemented six kernels. Notice that in the heatbathand over-relaxation methods, since we need to calculate the change of the action for a Monte Carlo step byaddressing the nearest neighbors links in the four space-time directions, we employ the chessboard method,calculating the new links separately by direction and by even/odd sites. • Initialization of the array random number generator. We use the CURAND library of NVIDIA [38]. • Initialization of the lattice array. The initialization can be done with a random SU( N c ) matrix (hotstart) or with the identity matrix (cold start). • Link update by heatbath algorithm. Note that for N c ≥ this method is called the pseudo-heatbathalgorithm, as discussed in section 2. This kernel must be called eight times, since this must be doneby link direction and even and odd sites separately, because we need to calculate the staple at eachlink direction. 5 ernel PHB and OVR REU PLAQper thread per link per site per site

SU(2) 4 reals

76 none 96Single/Double (bytes) 304/608 none 384/768

SU(3) 18 reals

342 72 432Single/Double (bytes) 1368/2736 288/576 1728/3456

SU(3) 12 reals

228 48 288Single/Double (bytes) 912/1824 192/384 1152/2304

SU(4) 32 reals

608 128 768Single/Double (bytes) 2431/4864 512/1024 3072/6144Table 1: Kernel memory loads per thread for pseudo-heatbath kernel (PHB), over-relaxation kernel (OVR),re-unitarization kernel (REU) and plaquette kernel (PLAQ).

Nt N σ Nt/gpu_n Gpu_n-1 Gpu_0

Gpu_1 (a)

Ghost cells N σ Border cells Gpu_i Nt/gpu_n Gpu_i+1 (b)

Figure 2: Schematic view of the lattice array handled by each GPU. • Lattice over-relaxation. This kernel has to be implemented in the same way as the kernel to updateeach link by the heatbath method kernel, by link direction and even/odd sites separately. • Lattice re-unitarization, with the standard Gram-Schmidt technique, implemented only for SU( N c )with N c ≥ • Plaquette at each lattice site. The sum over all the lattice sites is done with the parallel reductioncode in the NVIDIA GPU Computing SDK package [39], which is already a fully optimized code.In Table 1, we summarize the memory loads per thread by kernel. The lattice SU( N c ) is very memory traﬃcconsuming. In the pseudo-heatbath and over-relaxation methods, to update a single link, it is necessary tocopy from the lattice array memory 18 links, which make the staple, plus the link to be updated.We now address the multi-GPU approach using CUDA and OpenMP. In order to use and control severalGPUs on the same system, we need to have one CPU thread per GPU. The OpenMP allows us to do thiswhen we have several GPUs on the same system. Therefore, we split the lattice along the temporal part6mong several GPUs, see Fig. 2a. The total lattice size in each GPU is now N σ × × ( N t / ( num. gpus ) + 2) ,with four for the link direction and two for the neighboring sites (ghost cells), see Fig. 2b. After eachkernel, the border cells in each GPU need to be exchanged between each GPU. For this reason, the linksat the borders of each sublattice have to be transferred from one GPU to the GPU handling the adjacentsublattice. In order to exchange the border cells between GPUs it is necessary to copy these cells to CPUmemory and then synchronize each CPU thread with the command omp barrier before updatingthe GPU memory (ghost cells).Since memory transfers between CPU and GPU are very slow compared with other GPU memory andin order to maximize the GPU performance, we should only use this feature when it is extremely necessary.Hence, we only use CPU/GPU memory transfers in three cases: at the end of the kernel to perform the sumover all lattice sites (copy the ﬁnal result to CPU memory, plaquette), when using multi-GPUs (exchangethe border cells between GPUs) and ﬁle storing independently pure gauge lattice conﬁgurations.

4. Results

Here we present the benchmark results using two diﬀerent GPU architectures (GT200 and Fermi), Table2, in generating pure gauge lattice conﬁgurations. We also compare the performance with two and threeFermi GPUs working in parallel in the same motherboard, using CUDA and OpenMP.We compare the performance using the texture memory (tex) and using the L1 and L2 cache memories(cache). In the GT200 architecture, since it does not have L1 and L2 caches, the cache label in the ﬁgurescorresponds to accessing directly the global memory. In the ﬁgures we use the notation "tex" when usingthe texture memory and "cache" when not using the texture memory.To test the performance of each kernel implemented for the SU(2), SU(3), SU(4) and generic SU( N c )codes, we use CUDA Proﬁler to measure the time spent for each kernel. For the SU(2) code, we perform300 iterations, where each iteration consists of one heatbath step and one over-relaxation step. At the endof each step the plaquette is calculated. For the SU(3), SU(4) and generic SU( N c ) codes, the procedure isthe same. However, after the pseudo-heatbath and over-relaxation steps, a matrix re-unitarization is done.For the multi-GPU part, we measure the total time to make a cold start to the system (all the links areinitialized with the identity matrix) and perform 300/100/100 iterations of the SU(2)/SU(3)/SU(4) codeswith one (pseudo-)heatbath and over-relaxation step, followed by link re-unitarization and at the end of eachiteration the plaquette is calculated. Note that we do not take into account the time for the initializationof the CURAND random number generator.In Table 1, we show the memory loads for each kernel (heatbath/pseudo-heatbath, over-relaxation,plaquette and re-unitarization) in the SU(2), SU(3) and SU(4) codes.Our code can be downloaded from the Portuguese Lattice QCD collaboration homepage [40]. When accessing the global memory, copying 128-bit words gives fully coalesced memory transactions.Although in single precision we can use a ﬂoat4 (128-bit word) array to store all the SU(2) elements, this isnot the case in double precision. Using a double4 format does not give fully coalesced memory transactionssince it is a 256-bit word, whereas the double2 format is a 128-bit word and gives fully coalesced memorytransactions.In Fig. 3, we show the performance in double precision using a double4 array and a double2 arraywith one and two GPUs and 3D thread blocks. The best performance is obtained when using a double4array and Texture memory. Nevertheless, using a double4 array and Texture memory we have achieved thehighest performance using one or two GPUs. However, if using the L1 and L2 caches, the maximum gain inperformance of using a double2 array is 25%/10% using one/two GPUs compared with a double4 array. Inthe following performance results, we will use the double4 array.In Fig. 4, we show the performance in GFlops as a function of the lattice volume, for each kernel insingle GPU mode. The heatbath and over-relaxation kernels are much slower than the plaquette kernel,updating a new link is the heavy part of the code. The performance of the over-relaxation kernel is higher7VIDIA Geforce GTX 295 580Number of GPUs 2 1CUDA Capability 1.3 2.0Multiprocessors (MP) 30 16Cores per MP 8 32Number of cores 2 ×

240 512Global memory 1792 MB GDDR3 3072 MB(896MB per GPU) GDDR5Number of threads per block 512 1024Registers per block 16384 32768Shared memory (per SM) 16KB 48KB or 16KBL1 cache (per SM) None 16KB or 48KBL2 cache (chip wide) None 768KBClock rate (GHz) 1.37 1.57Memory Bandwidth (GB/s) 223.8 192.4Table 2: NVIDIA’s graphics card speciﬁcations used in this work. Using OpenMP we also work with two295 GTX boards (4 GPUs in total) and three 580 GTX boards (3 GPUs in total). d2 - cache d2 - tex d4 - cache d4 - tex G F l o p s n n Figure 3: SU(2) CUDA performance in GFlops using one and two GTX 580 GPUs with CUDA and OpenMPin double precision using 3D thread blocks. d2 corresponds to a double2 array and d4 to a double4 array."n" corresponds to the number of points in each lattice dimension, i.e., n = N x = N y = N z = N t . "tex"means using Texture memory and "cache" using cache memory.than the heatbath kernel. Therefore, the link generation with some steps of over-relaxation increases theoverall performance of the code, as it decreases the number of steps between conﬁgurations by decreasing thecorrelation time. In SU(2), it is common to use one step of heatbath followed by four steps of over-relaxation.Since the heatbath and over-relaxation kernels have to update a new link using the staple, these kernelshave to be called in eight steps, i.e., to perform a full lattice update and avoid a possible new neighboringlink update, these kernels have to perform an independent update by link direction and by even/odd sites.8 F l op s G F l op s Figure 4: SU(2) CUDA performance. Left to right: heatbath, over-relaxation and plaquette kernels. Thetop line of graphics corresponds to performance in single precision and the bottom to double precision. "n"corresponds to the number of points in each lattice dimension, i.e., n = N x = N y = N z = N t .Note the performance in double precision is almost one half of the performance in single precision.The best performance is obtained for 1D thread blocks as expected. When the lattice volume is notdivisible by the number of threads, we have to add one more thread block which is not fully occupied.However when using 3D thread blocks this number can be higher. On the other hand, using 1D threadblocks, the total number of threads allowed is less than using 3D thread blocks, which limits the lattice size.To overcome this limitation, we may have to call the kernel several times in order to visit all the remainingsites, but this was not implemented in the code. As we mentioned in the previous section, the use of 1Dthread blocks can only accommodate 65535 thread blocks, which can be insuﬃcient when using large latticevolumes. For example, using 256 threads per block, we can have up to 256 thread blocks in a 1D grid andtherefore the lattice volume must be less than . Moreover, the performance with 1D thread blocks ispractically the same in double precision and higher than 3D thread blocks only in single precision.In Fig. 5, we show the overall performance of the SU(2) code using one, two and three GPUs, NVIDIAGTX 580. The performance in multi-GPU mode is dependent on the memory traﬃc between GPUs andCPU. The memory bandwidth between GPU and CPU is less than the global memory access. Therefore,the performance with more than one GPU is dependent on the lattice size. In Figs. 10a and 10d we plotthe scaling performance from one to three GPUs for a ﬁxed lattice size of in single and double precisionrespectively.For a lattice volume and using three GPUs, we obtain 210 and 122 GFlops in single and doubleprecision respectively. Note that we have not yet implemented an overlap between computing and memorytransfers and therefore we still can improve the performance with a full overlap in a future implementation.The implementation with 1D thread blocks has an overall performance higher than the implementationwith 3D thread blocks in single precision, but in double precision the performance is practically the same.Therefore, in the next two subsections, SU(3) and SU(4) CUDA performance, we test the performance using9 GPU 2GPU 3GPU G F l o p s n

10 20 30 40 50 050100150200 n

10 20 30 40 50 050100150200 n n

3D - tex3D - cache1D - tex1D - cache (a) G F l o p s n

10 20 30 40 50 0255075100125 n

10 20 30 40 50 0255075100125 n n

3D - tex3D - cache1D - tex1D - cache (b)

Figure 5: SU(2) CUDA performance in GFlops using one, two and three GTX 580 GPUs with CUDA andOpenMP. Figs. (a) and (b) correspond to single and double precision respectively. "n" corresponds to thenumber of points in each lattice dimension, i.e., n = N x = N y = N z = N t .3D thread blocks. We test the performance using three diﬀerent implementations, a lattice array with 18 real numbers inan array of structures and a structure of arrays and a lattice array with 12 real numbers in a structureof arrays. We also tested the performance in single and double precision for each kernel, Fig. 6. In theSU(3) code, the performance is directly aﬀected by the memory transfers and matrix-matrix multiplications.Compared with SU(2), the memory size transfers increase by a factor of 3 and 4.5 for 12 and 18 real numberrepresentation in a lattice array and the process to update an SU(3) link consists of three steps of SU(2).In Fig. 7, we show the SU(3) performance in GFlops using one, two and three GPUs. As expected, ifwe reduce the thread memory access from 18 real numbers to 12 real numbers per link, we can increase theperformance . × and . × for single and double precision respectively, using three GPUs. In single GPUmode the best way to store the full lattice is a SOA. However, in multi-GPU mode the AOS implementationis better. In the AOS implementation the number of copies is less than the number of copies in the SOAimplementation, while the amount of memory size transferred is the same. In Figs. 10b and 10e weplot the scaling performance from one to three GPUs for a ﬁxed lattice size of in single and doubleprecision respectively. Reducing the size of memory transfers, in this case from 18 to 12 real numbers, theperformance increases. Another important aspect of using 12 real numbers is that we can have bigger lattice10 F l op s GTX 295: aos-18 – cacheGTX 295: aos-18 – texGTX 580: aos-18 – cache GTX 295: soa-18 – cacheGTX 295: soa-18 – texGTX 580: soa-18 – cache GTX 295: soa-12 – cacheGTX 295: soa-12 – texGTX 580: soa-12 – cache G F l op s Figure 6: SU(3) CUDA performance in GFlops. Left to right: heatbath, over-relaxation, re-unitarizationand plaquette kernels. The top line of graphics corresponds to performance in single precision and thebottom to double precision. "n" corresponds to the number of points in each lattice dimension, i.e., n = N x = N y = N z = N t .sizes. Therefore, using one or more GPUs is intrinsically dependent on how the data is organized in thememory.For a lattice volume and using three GPUs, we obtain 292 and 96 GFlops in single and doubleprecision respectively. In this subsection, we test the SU(4) CUDA code performance using an array of structures and a structureof arrays in single and double precision. As in the previous subsection, we only implemented and tested 3Dthread blocks.In Fig. 8, we show performance in GFlops as a function of the lattice volume, for each kernel in singleGPU mode. Compared with SU(2), the process to update an SU(4) link consists of six steps of SU(2), threesteps more than SU(3). In SU(4), the memory size increased eight times compared with the SU(2) code.In Fig. 9 we show the performance in GFlops for one, two and three GPUs, in single and doubleprecision. The AOS implementation gives better results in single precision. However, in double precisionthere is almost no diﬀerence between the AOS and SOA implementations. In Figs. 10c and 10f we plot thescaling performance from one to three GPUs for a ﬁxed lattice size of in single and double precisionrespectively. Therefore, using one or more GPUs is intrinsically dependent on how the data is organized inthe memory.For a lattice volume and using three GPUs, we obtain 169 and 86 GFlops in single and doubleprecision respectively. 11 GPU 2GPU 3GPU G F l o p s n

10 20 30 40 50 050100150200250300 n

10 20 30 40 50 050100150200250300 n soa12:soa18:aos18: (a) G F l o p s n

10 20 30 40 50 020406080100 n

10 20 30 40 50 020406080100 n soa12:soa18:aos18: (b) Figure 7: SU(3) CUDA performance in GFlops using one, two and three GTX 580 GPUs with CUDA andOpenMP. Figs. (a) and (b) correspond to single and double precision respectively. "n" corresponds to thenumber of points in each lattice dimension, i.e., n = N x = N y = N z = N t . N c ) CUDA performance In Fig. 11 we compare the time to perform one iteration in the SU(2), SU(3), SU(4) optimized codesand a generic code valid for N c ≥ . As expected, the SU(4) generic code is less eﬃcient than the optimizedSU(4) code.Thus the application of our code to very large N c is signiﬁcantly slower than the small N c codes dueto the optimization and to the scaling with N c of the generic code. For the same lattice size, generatingconﬁgurations in SU(32) is six orders of magnitude slower that in SU(2).In what concerns our generic code valid only for N c ≥ , as N c increases, the memory accesses alsoincrease, and the number of subgroups of SU(2) increase with N c ( N c − / as well as the number of ﬂoatingpoint operations, since each link is made of N c × N c × ﬂoating numbers.As seen previously in the SU(2), SU(3) and SU(4) performance results, the highest performance dependsintrinsically on the group number, N c , on how the data is organized in GPU memory and on the numberof GPUs. Thus, to get the highest performance for generic N c ’s, especially for large ones, this is a highlydemanding task, due to the limited GPU resources compared to the CPU, namely memory and registers.In the single precision case we can go up to N c = 32 and in double precision we can go up to N c = 16 for alattice volume using a GTX 580 GPU.In our generic code, the time to perform one iteration goes up as the third power of N c , Fig. 12.This agrees with the fact that to perform a pseudo-heatbath/overrelaxation step there are N c ( N c − / subgroups of SU(2) to be generated and then multiplied by the actual link. Each multiplication is of the12 F l op s G F l op s Figure 8: SU(4) CUDA performance in GFlops. Left to right: heatbath, over-relaxation, re-unitarizationand plaquette kernels. The top line of graphics corresponds to performance in single precision and thebottom to double precision. "n" corresponds to the number of points in each lattice dimension, i.e., n = N x = N y = N z = N t . G F l o p s n

10 20 30 40 50 050100150200250 n

10 20 30 40 50 soaaos (a) G F l o p s n

10 20 30 40 50 020406080 n

10 20 30 40 50 soaaos (b)

Figure 9: SU(4) CUDA performance in GFlops using one, two and three GTX 580 GPUs with CUDA andOpenMP. Figs. (a) and (b) correspond to single and double precision respectively. "n" corresponds to thenumber of points in each lattice dimension, i.e., n = N x = N y = N z = N t .order of N c and therefore this gives the N c factor. Note that the calculation of the staple, needed in thepseudo-heatbath/overrelaxation steps, also goes with N c .13 D - cache1D - tex3D - cache3D - tex G F l o p s Number of GPUs (a) SU(2) single precision. aos18soa18soa12 G F l o p s Number of GPUs (b) SU(3) single precision. aossoa G F l o p s Number of GPUs (c) SU(4) single precision.

1D - cache1D - tex3D - cache3D - tex G F l o p s Number of GPUs (d) SU(2) double precision. aos18soa18soa12 G F l o p s Number of GPUs (e) SU(3) double precision. aossoa G F l o p s Number of GPUs (f) SU(4) double precision.

Figure 10: SU(2), SU(3) and SU(4) CUDA performance in GFlops as a function of the number of GPUsin single and double precision. The lattice volume is kept constant at for SU(2) and SU(3) and forSU(4).

5. Conclusion

We developed codes in CUDA to generate pure gauge SU( N c ) conﬁgurations for lattice QCD simulations.The technique used to store the SU( N c ) elements in the global memory is important to achieve the bestperformance. We produce speciﬁc codes for SU(2), SU(3) and SU(4), to optimize the group parameteriza-tions. We also implement a generic code valid for any SU( N c ), taking into account the best way to storethe elements in global memory.Due to the limited amount of GPU resources per thread, because the matrix size is N c × N c , theimplementation of the generic SU( N c ) cannot operate for an arbitrarily large N c . As N c increases, thememory accesses also increase, and the number of subgroups of SU(2) increases with N c ( N c − / as wellas the number of ﬂoating point operations, since each link is made of N c × N c × ﬂoating numbers. Thusthe codes with very large N c are signiﬁcantly slower than those with moderate N c ones.Nevertheless, as shown in Fig. 11, up to SU(6) our generic SU( N c ) conﬁguration generation code is onlyone order of magnitude slower than the optimized SU(3) code, and it is possible up to some extent to studythe large N c limit. 14 U(Nc) dpSU(4) dp opt.SU(3) dp opt.SU(2) dp opt.SU(Nc) spSU(4) sp opt.SU(3) sp opt.SU(2) sp opt. T i m e ( s ) −4 −3 Nc Figure 11: SU( N c ) CUDA time per iteration in single (sp) and double (dp) precision for a lattice volumeand β = 6 . running in a GTX 580 GPU. One iteration consists of one pseudo-heatbath step, one over-relaxation step followed by link re-unitarization. Comparison between the generic SU( N c ) code for N c ≥ (black circle and red diamond) with the optimized SU(2), SU(3) and SU(4) codes. T i m e ( s ) N c c ) CUDA time per iteration in single (sp) Fit Function : a0+a1*x+a2*x^2+a3*x^3

Chi^2/doF = 6.7275e-03 a0 = -1.0597 +/- 5.9957e-01 a1 = 5.3768e-01 +/- 1.0942e-01 a2 = -7.6273e-02 +/- 5.9715e-03 a3 = 3.7229e-03 +/- 9.9324e-05 Figure 12: Fit of the SU( N c ) CUDA time (s) per iteration in single precision, shown in Fig. 11. Thepolynomial ﬁt up to N c agrees with scaling of the pseudo-heatbath/overrelaxation steps. Acknowledgments

This work was partly funded by the FCT contracts, POCI/FP/81933/2007, CERN/FP/83582/2008,PTDC/FIS/100968/2008, CERN/FP/109327/2009 and CERN/FP/116383/2010.15uno Cardoso is also supported by FCT under the contract SFRH/BD/44416/2008.We thank Marco Cardoso for useful discussions.

References [1] J. Gross, W. Janke, M. Bachmann, Massively parallelized replica-exchange simulations of polymers on gpus, ComputerPhysics Communications 182 (8) (2011) 1638 – 1644. doi:10.1016/j.cpc.2011.04.012 .URL [2] A. H. Zonoozi, A. H. Kuepper, H. Baumgardt, H. Haghi, P. Kroupa, et al., Direct N-body simulations of globular clusters:(I) Palomar 14 arXiv:1010.2210 .[3] A. Hassan, C. J. Fluke, Scientiﬁc Visualization in Astronomy: Towards the Petascale Astronomy Era arXiv:1102.5123 .[4] J. Kanzaki, Monte carlo integration on gpu, The European Physical Journal C - Particles and Fields 71 (2011) 1–7,10.1140/epjc/s10052-011-1559-8.URL http://dx.doi.org/10.1140/epjc/s10052-011-1559-8 [5] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, K. Schulten, Accelerating molecular modelingapplications with graphics processors, Journal of Computational Chemistry 28 (16) (2007) 2618–2640. doi:10.1002/jcc.20829 .URL http://dx.doi.org/10.1002/jcc.20829 [6] S. Tomov, J. Dongarra, M. Baboulin, Towards dense linear algebra for hybrid gpu accelerated manycore systems, ParallelComputing 36 (5-6) (2010) 232 – 240. doi:10.1016/j.parco.2009.12.005 .URL [7] N. Cardoso, P. Bicudo, SU(2) Lattice Gauge Theory Simulations on Fermi GPUs, J. Comput. Phys. 230 (2011) 3998–4010. arXiv:1010.4834 , doi:10.1016/j.jcp.2011.02.023 .[8] N. Cardoso, P. J. Silva, P. Bicudo, O. Oliveira, Landau Gauge Fixing on GPUs, Computer Physics Communications184 (1) (2013) 124 – 129. arXiv:1206.0675 , doi:10.1016/j.cpc.2012.09.007 .URL [9] G. ’t Hooft, Renormalizable Lagrangians For Massive Yang-Mills Fields, Nucl. Phys. B35 (1971) 167–188. doi:10.1016/0550-3213(71)90139-8 .[10] G. ’t Hooft, A Planar Diagram Theory For Strong Interactions, Nucl. Phys. B72 (1974) 461. doi:10.1016/0550-3213(74)90154-0 .[11] E. Witten, Large N Chiral Dynamics, Ann. Phys. 128 (1980) 363. doi:10.1016/0003-4916(80)90325-5 .[12] M. Creutz, MONTE CARLO STUDY OF RENORMALIZATION IN LATTICE GAUGE THEORY, Phys.Rev. D23(1981) 1815. doi:10.1103/PhysRevD.23.1815 .[13] M. Creutz, MONTE CARLO STUDIES OF NONABELIAN GAUGE THEORIES.[14] M. Creutz, NUMERICAL TECHNIQUES FOR LATTICE GAUGE THEORIES.[15] M. Teper, Large N and conﬁning ﬂux tubes as strings - a view from the lattice, Acta Phys.Polon. B40 (2009) 3249–3320. arXiv:0912.3339 .[16] M. Panero, Thermodynamics of the QCD plasma and the large-N limit, Phys.Rev.Lett. 103 (2009) 232001. arXiv:0907.3719 , doi:10.1103/PhysRevLett.103.232001 .[17] R. Lohmayer, H. Neuberger, Numerical study of large-N phase transition of smeared Wilson loops in 4D pure YM theory,PoS LATTICE2011 (2011) 247. arXiv:1110.4543 .[18] R. Lohmayer, H. Neuberger, Non-analyticity in scale in the planar limit of QCD, Phys.Rev.Lett. 108 (2012) 061602. arXiv:1109.6683 , doi:10.1103/PhysRevLett.108.061602 .[19] F. Bursa, R. Lau, M. Teper, SO(2N) and SU(N) gauge theories in 2+1 dimensions arXiv:1208.4547 .[20] R. Lohmayer, H. Neuberger, Rectangular Wilson Loops at Large N, JHEP 1208 (2012) 102. arXiv:1206.4015 , doi:10.1007/JHEP08(2012)102 .[21] A. Mykkanen, M. Panero, K. Rummukainen, Casimir scaling and renormalization of Polyakov loops in large-N gaugetheories, JHEP 1205 (2012) 069. arXiv:1202.2762 , doi:10.1007/JHEP05(2012)069 .[22] G. I. Egri, Z. Fodor, C. Hoelbling, S. D. Katz, D. Nogradi, et al., Lattice QCD as a video game, Comput.Phys.Commun.177 (2007) 631–639. arXiv:hep-lat/0611022 , doi:10.1016/j.cpc.2007.06.005 .[23] M. Clark, R. Babich, K. Barros, R. Brower, C. Rebbi, Solving Lattice QCD systems of equations using mixed precisionsolvers on GPUs, Comput.Phys.Commun. 181 (2010) 1517–1528. arXiv:0911.3191 , doi:10.1016/j.cpc.2010.05.002 .[24] M. Hayakawa, K.-I. Ishikawa, Y. Osaki, S. Takeda, S. Uno, et al., Improving many ﬂavor QCD simulations using multipleGPUs, PoS LATTICE2010 (2010) 325. arXiv:1009.5169 .[25] H.-J. Kim, W. Lee, Multi GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions, PoS LAT-TICE2010 (2010) 028. arXiv:1010.4782 .[26] C. Bonati, G. Cossu, M. D’Elia, A. Di Giacomo, Staggered fermions simulations on GPUs , PoS LATTICE2010 (2010)324. arXiv:1010.5433 .[27] B. Walk, H. Wittig, E. Dranischnikow, E. Schomer, Implementation of the Neuberger-Dirac operator on GPUs, PoSLATTICE2010 (2010) 044. arXiv:1010.5636 .[28] R. Babich, M. A. Clark, B. Joó, Parallelizing the quda library for multi-gpu calculations in lattice quantum chromody-namics (2010) 1–11 doi:10.1109/SC.2010.40 .URL http://dx.doi.org/10.1109/SC.2010.40

29] T.-W. Chiu, T.-H. Hsieh, Y.-Y. Mao, K. Ogawa, GPU-Based Conjugate Gradient Solver for Lattice QCD with Domain-Wall Fermions, PoS LATTICE2010 (2010) 030. arXiv:1101.0423 .[30] A. Alexandru, C. Pelissier, B. Gamari, F. Lee, Multi-mass solvers for lattice QCD on GPUs, J.Comput.Phys. 231 (2012)1866–1878. arXiv:1103.5103 .[31] F. Winter, Accelerating QDP++ using GPUs arXiv:1105.2279 .[32] A. Alexandru, M. Lujan, C. Pelissier, B. Gamari, F. Lee, Eﬃcient implementation of the overlap operator on multi-gpus(2011) 123–130 doi:10.1109/SAAHPC.2011.13 .URL http://dx.doi.org/10.1109/SAAHPC.2011.13 [33] C. Bonati, G. Cossu, M. D’Elia, P. Incardona, QCD simulations with staggered fermions on GPUs, Comput.Phys.Commun.183 (2012) 853–863. arXiv:1106.5673 .[34] R. Babich, M. A. Clark, B. Joó, G. Shi, R. C. Brower, S. Gottlieb, Scaling lattice qcd beyond 100 gpus (2011) 70:1–70:11 doi:10.1145/2063384.2063478 .URL http://doi.acm.org/10.1145/2063384.2063478 [35] M. Creutz, Monte Carlo Study of Quantized SU(2) Gauge Theory, Phys.Rev. D21 (1980) 2308–2315. doi:10.1103/PhysRevD.21.2308 .[36] M. Creutz, Conﬁnement and Lattice Gauge Theory, Phys.Scripta 23 (1981) 973. doi:10.1088/0031-8949/23/5B/011 .[37] N. Cabibbo, E. Marinari, A New Method for Updating SU(N) Matrices in Computer Simulations of Gauge Theories, Phys.Lett. B119 (1982) 387–390. doi:10.1016/0370-2693(82)90696-7 .[38] NVIDIA, CUDA CURAND Library (2010).[39] M. Harris, Optimizing Parallel Reduction in CUDA, NVIDIA Developer Technology, NVIDIA GPU computing SDK, 3rdEdition (2010).[40] Portuguese Lattice QCD Collaboration, http://nemea.ist.utl.pt/~ptqcd ..