Domain Decomposition method on GPU cluster
aa r X i v : . [ h e p - l a t ] N ov HUPD-1006
Domain Decomposition method on GPU cluster
Yusuke Osaki ∗ and Ken-Ichi Ishikawa Graduate School of Science, Hiroshima University, Kagamiyama 1-3-1, Higashi-Hiroshima,739-8526, JapanE-mail: [email protected]
E-mail: [email protected]
Pallalel GPGPU computing for lattice QCD simulations has a bottleneck on the GPU to GPUdata communication due to the lack of the direct data exchanging facility. In this work we investi-gate the performance of quark solver using the restricted additive Schwarz (RAS) preconditioneron a low cost GPU cluster. We expect that the RAS preconditioner with appropriate domain-decomposition and task distribution reduces the communication bottleneck. The GPU cluster weconstructed is composed of four PC boxes, two GPU cards are attached to each box, and wehave eight GPU cards in total. The compute nodes are connected with rather slow but low costGigabit-Ethernet. We include the RAS preconditioner in the single-precision part of the mixed-precision nested-BiCGStab algorithm and the single-precision task is distributed to the multipleGPUs. The benchmarking is done with the O ( a ) -improved Wilson quark on a randomly gener-ated gauge configuration with the size of 32 . We observe a factor two improvment on the solverperformance with the RAS precoditioner compared to that without the preconditioner and findthat the improvment mainly comes from the reduction of the communication bottleneck as weexpected. The XXVIII International Symposium on Lattice Field Theory, Lattice2010June 14-19, 2010Villasimius, Italy ∗ Speaker. c (cid:13) Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlikeLicence. http://pos.sissa.it/ omain Decomposition method on GPU cluster
Yusuke Osaki
1. Introduction
The application of General-Purpose GPU (GPGPU) computing for lattice QCD simulations isvery attractive and there have been several studies in the literature [1, 2]. The most of the previousGPGPU works for lattice QCD simulations have focused on the acceleration of the quark solverusing a single GPU card. However single GPU is not sufficient to simulate QCD with more realisticlattice parameters, such as over 32 lattices with physical quark masses, due to the lack of memorysize and required sustained speed. Thus we need parallel GPGPU computing platforms with mul-tiple GPU cards. This year several studies for lattice QCD simulations with/without multiple GPUcards are reported in this conference [3]. In this paper we report our trial and benchmarking studyof the quark solver on a GPU cluster we developed.One of the bottleneck of parallel computing is in the data communication among computeprocessing units generally. Any multipile GPU system such as PC cluster with GPUs (= GPUcluster) also has the same bottleneck. The situation is worse for the GPU cluster since there is noon-board facility to directly exchange the data between GPU memories on distinct nodes. Typicallythe data path consists of the PCI-Express path between host CPU memory and GPU memory, andthe LAN connection path between two host CPU memories. This causes a long latency and a slowdata throughput.In the lattice QCD simulations the most compute and communication intensive part is themultiplication of the lattice Dirac operator on fermion fields in the linear equation (quark) solver: D f = h , (1.1)where D is a lattice Dirac operator, f and h are fermion fields. To solve Eq.(1.1), the Krylovsubspace solver algorithms, such as CG, BiCGStab etc., have been used. The naive use of theKrylov solver requires many multiplication of D on a vector v such as w = Dv to obtain the solutionvector f . The GPU acceleration can be applied to this operation. The bottleneck explained above,however, degrades the performance of w = Dv operation. Thus the algorithmic reconsideration isrequired to remove the bottleneck.In this paper we study the additive Schwarz domain-decomposition preconditioner with/withoutdomain overlapping [4]. The additive Schwarz preconditioner is a kind of domain-decompositionpreconditioner for elliptic partial differential equations. Lüscher has introduced the Schwarz alter-nating method to the Wilson-Dirac quark solver as the preconditioner and obtained enormous speedup combined with the single precision acceleration technique [5]. The Schwarz alternating methodcorresponds to the multiplicative Schwarz preconditioner and we expect a similar improvement forthe additive Schwarz method. We study this possibility with the O ( a ) -improved Wilson quark on amoderate size lattice. The performance is measured on a GPU cluster we developed.In the next section, we explain the details of the restricted additive Schwarz (RAS) domain-decomposition iteration [6]. To accelerate the solver using multiple GPUs we employ the mixed-precision nested-BiCGStab solver [2, 7, 8], and we apply the RAS method to the GPU side BiCGStabsolver as the preconditioner. The acceleration of the solver with the GPU and the RAS is explainedin section 3. We show the details of our GPU machine and the programming environment insection 4. We test the effect of the RAS preconditioner varying the parameters of the RAS precon-2 omain Decomposition method on GPU cluster Yusuke Osaki
Figure 1:
Lattice domain-decomposition and relation to the RAS iteration. ditioner and study the bottleneck by investigating the timing chart of the algorithm. The results areshown in section 5 and we give a brief summary for the results in the last section.
2. The Restricted Additive Schwarz domain-decomposition iteration
The restricted additive Schwarz iteration [6] is a kind of the fixed iteration solver for ellipticdifferential equations. This solver makes use of the geometrical structure of a latticized partialdifference equation. In lattice QCD the discretized space-time can be split into several domains andwe show the schematic picture of the decomposition in Fig.1. W i represents the lattice sites in the i -th domain without overlapping. W ′ i denotes the domain extended from W i . The extended domainsare overlapped in general and the data in overlapped region are replicated on the neighbouringdomains.To solve Eq. (1.1) without domain overlapping, we expect that the solution f can be approxi-mated by combining the partial solution of x W i derived from D W i x W i = h W i from each domain, where D W i is the restriction of D to W i with the Dirichlet boundary condition. The additive Schwarz (AS)iteration simply approximates it as f ∼ (cid:229) i x W i , and the approximation is refined by the Richardsoniteration. A problem arises when we overlap the decomposition since the approximate solutionderived from the extended equation D W ′ i x W ′ i = h W ′ i becomes inconsistent in the overlapped region.The restricted additive Schwarz (RAS) iteration gives a simple solution to this inconsistency. InFig.1 we denote the restriction operation as R W i arrow which simply extracts the data on the bulksites ( W i ∈ W ′ i ) to avoid the inconsistency. Thus the approximation to f can be constructed as f ∼ (cid:229) i R W i x W ′ i . We show the RAS iteration in Alg. 1. The fourth line pickups the data on W ′ i fromthe whole field vector, the fifth line solves the target problem restricted in the overlapped domain W ′ i with the Dirichlet boundary condition, and the next line represents the restriction process describedabove.The RAS iteration itself is not sufficient for the complete solver, and is usually used as thepreconditioner for the Krylov subspace iterative solvers. We employ BiCGStab solver for theKrylov subspace solver. The RAS preconditioner K RAS corresponds to the following operator; K RAS = S NRAS − (cid:229) j = ( − DS ) j , with S = N (cid:229) i = R W i ( D − W ′ i ) P W ′ i . (2.1)This is applied to the following preconditioned equation; DK RAS c = h , f = K RAS c , (2.2)3 omain Decomposition method on GPU cluster Yusuke Osaki
Algorithm 1
The RAS iteration. This calculates f = K RAS h ∼ D − h with input h , and output f . set initial solution to f = r = h . for j = , , . . . , NRAS − do for i = , . . . , N (domain block index loop) do r W ′ i = P W ′ i r . (Projection to domain W ′ i .) solve D W ′ i v W ′ i = r W ′ i for v W ′ i with the Dirichlet boundary condition. v W i = R W i v W ′ i . (Restriction to domain W i .) end for v = (cid:229) Ni v W i . update f = f + v ; r = r − Dv . end for where c is to be solved with the Krylov subspace iterative solvers. Note that in the additive Schwarzcase, the domain equation can be solved independently from other domains and the domain index i in Alg. 1 can be completely parallelized. We assign a single domain to a single GPU in this paper.To obtain the best performance with the RAS preconditioner we should appropriately optimizethe following three parameters. The one is the depth of the overlapped region d . In Fig. 1 we showthe depth d = N dominv is usually used for ( D W ′ i ) − . The last parameter is NRAS . These parameters are surveyed in the benchmarking test.The RAS preconditioner is a kind of generalized (blocked) Jacobi preconditioner. One canexpect that the performance of the RAS becomes more better as increasing the overlap depth d since the equation in a domain approaches to the original equation. However it requires extraworks on the overlapped region and degrades the total performance. Therefore there should be aoptimal choice for d . The domain size also affects the performance. The larger domain size ismore better as the preconditioner but it reduces the domain parallelism. Overlapping domains hasa gain when it is used for the GPU acceleration because GPU can keep a high performance forlarger domain size. In the next section we explain the details of the GPU implementation of theBiCGStab solver and the RAS preconditioner.
3. Accelerating Krylov solver with the Schwarz method and GPUs
The GPU architecture is originally dedicated for computer graphics application and showsa great performance for the single-precision arithmetic. We employ the mixed-precision nested-BiCGStab (flexible BiCGStab) algorithm [7, 8] to extract the single-precision performance effi-ciently. The mixed-precision nested-BiCGStab consists of an inner and an outer BiCGStab solvers,where the outer solver solely runs with the double-precision while the inner solver works with thesingle-precision. The inner solver corresponds to the preconditioner to the outer solver and themost of arithmetics are done within the inner solver side. Thus we can extract the best performanceof GPU by assigning the inner solver task to GPUs.As described in the introduction the data communication is a bottleneck of the GPU acceler-ated parallel computing. There is a possibility for the RAS preconditioner to reduce the bottleneckby appropriately matching the domain-decomposition and the node allocation. In this paper we4 omain Decomposition method on GPU cluster
Yusuke Osaki split the lattice so that a single GPU is responsible to a single domain (block) and assign the RASpreconditioner to the inner single-precision BiCGStab solver. In this manner we can extract thetrue performance of GPUs since the domain equations of the RAS preconditioner are solved incompletely parallel without data communication.The data communication arises in the 4th and 9th lines of Alg. 1. The Dv operation of the9th line is always required for any iterative solvers and the 4th line (projection to W ′ i ) is the extracommunication arises in the RAS preconditioner. The performance gain from the RAS precondi-tioner is expected when the total iteration count for the inner solver is sufficiently reduced by theRAS preconditioner so as to beat the increase of the communication overhead from the projectionoperation.
4. Machine and programinng
We construct a GPU cluster consists of four PC boxes. Each PC has a Intel Core i7 920running at 2.67GHz, 6GBytes DDR3 memory, two GeForce GTX 285 GPU cards, and a singleIntel Gigabit ET Quad Port Server adapter. The Gigabit Ethernet connection is rather slow but wemake use of the four ports via the network trunking facility of the OpenMPI libraly. The operatingsystem is CentOS 5.3. The programing language we employed is Intel Fortran for the outer double-precision BiCGStab solver (CPU host code) and NVIDIA CUDA 2.3 for the inner single-precisionBiCGStab solver and the RAS preconditioner (GPU code). To further improve the Gigabit Ethernetperformance, we use Open-MX protocol, which is freely available from [9], instead of TCP/IP.We split the whole lattice with the size of N x N y N z N t into N GPU = x -direction only. We extend the domain size by adding extra ghost/overlap region in both upward anddownward x -directions to constact the overlapped domain-decomposition. The resulting domainsize becomes ( N x / N GPU + s ) N y N z N t where s is the extension size and the depth of the overlap is d = s . This one-dimensional splitting is preferable compared to the multi-dimensional splittingin view of the communication overhead. The data structure is important to achieve the best perfor-mance of the Nvidia’s cards. As described in Refs. [2, 10], we have to carefully arrange the dataordering to extract the best performance. We assign a single CUDA thread to a single site. For thedetails for the CUDA threading/blocking in lattice QCD simulations, see Ref. [2].To clarify the bottleneck of the solver with the RAS preconditioner accelerated by multipleGPUs we investigate the timing chart of the whole algorithm until the solver yields the double-precision solution. In the next section we will show the computing and communication time for thefollowing region: T commproj for the communication time at the projection (4-th line of Alg. 1), T comm Dv for the communication time in the Wilson-Dirac operator multiplication (9-th line of Alg. 1 andEq. (2.2)), T calc Dv the computation time in the Wilson-Dirac operator multiplication, and T calcdominv thecomputation time in the approximate inversion of the domain-restricted Wilson-Dirac operator.
5. Results
We measure the solver time using a random gauge configuration on a 32 lattice. The blocksize for a single GPU becomes larger than 4 × which is enough size to extract the true per-formance of the GPU. The parameters for the O ( a ) -improved Wilson-Dirac fermion are chosen to5 omain Decomposition method on GPU cluster Yusuke Osaki be k = .
126 and c SW = . | h − D f | / | h | < − with a Gaussian noise vector h . We have investigated the performance of the RAS and GPU accelerated solver with the RAS pa-rameters ranging in NRAS = N dominv = NRAS = N dominv = d = , ,
4. In this section we show the results with thebest parameters only. No prec. RAS( d =
0) RAS( d =
2) RAS( d = Dv operation count 1,328 484 496 416total [sec] 48.243 18.246 18.441 15.546 Dv time T calc Dv [sec] 3.900 1.424 1.436 1.205 T comm Dv [sec] 47.358 17.917 18.111 15.269 T commproj [sec] - 0.0 6.772 10.611 T calcdominv [sec] - 5.954 6.245 6.993 Table 1:
Timings for performance comparison.
Table 1 shows the results from our benchmarking tests. The first column is the result withoutpreconditioning and the others are with the RAS preconditioner with d = , , Dv multiplication ( T comm Dv ) and in the projection ( T commproj ). The communation time dominates the totaltime and the bi-directional bandwidth is observed to be ∼
300 MByte/sec.The next row counts the Dv operation in the GPU solver. With the RAS preconditioner the Dv operation is much reduced from that without preconditioner as expected in section 3. Howeveroverlapping domains does not reduce the Dv operation from d = d =
2, and only a slightreduction is observed in the case from d = d =
4. The timings involved in the Dv operation areshown in the next row labeled by “ Dv time”. The Dv operation is dominated by the communicationtime, although we hide the communication behind the bulk computation of Dv . Therefore thereduction of the Dv operation count is almost identical to the reduction of the communication onour GPU cluster.The rows labeled by T commproj and T calcdominv show the timings for the projection communicationand for the approximate inversion of the domain-restricted equation respectively. These timingsare the extra cost for the RAS preconditioner. From these results we observe that the Dv operationcount reduction in the d = Dv operation reduction.6 omain Decomposition method on GPU cluster Yusuke Osaki
The same statement would hold even if there is a slight reduction of the Dv operation count in the d =
6. Summary
We have investigated the acceleration of the quark solver using multiple GPUs in parallel withthe combination of the RAS preconditioner and the mixed-precision nested-BiCGStab algorithm.Parallel GPU benchmarking tests have been done on a GPU cluster constructed for low cost latticeQCD simulations. The network device is slow compared to the speed of the GPU cards. Using theRAS preconditioner with the appropriate domain-decomposition and the GPU task assignment, wecan reduce the data communication overhead and have observed a factor two improvement with theRAS without domain-overlapping. However the overlapped domain-decomposition method doesnot work well on our GPU cluster due to the extra overhead arising from the projection operationand the inversion of the domain-restricted equation. The results with the RAS preconditioner isstill dominated by the communication time.A part of the program development has been done on the INSAM (Institute for NumericalSimulations and Applied Mathematics) GPU cluster at Hiroshima University. This work was sup-ported in part by the Grant-in-Aid for Scientific Research of Japan Society for the Promotion ofScience (JSPS) (No. 20740139).
References [1] G.I. Egri et al. , Comput.Phys.Commun. , 631 (2007) [arXiv:hep-lat/0611022]; F.D. Renzo et al. ,Pos(LATTICE 2008)024; K. Barros et al. , Pos(LATTICE 2008)045.[2] M.A.Clark et al. , Comput.Phys.Commun. , 1517 (2010); M.A. Clark, PoS(LATTICE 2009)003.[3] B. Babich et al. , LATTICE2010(2010), in these proceedings; S. Gottlieb et al. , ibid. ; H.-J. KIM et al. , ibid. ; Y.-Y. Mao et al. , ibid. ; B. Walk, ibid. ; K.-I. Ishikawa et al. , ibid. ; C. Bonati et al. , ibid. ;N. Cardoso et.al , ibid. .[4] B.F. Smith, P.E. Bjorstad and W.D. Gropp, Domain Decomposition: Parallel Multilevel Methods forElliptic Partial Differential Equations , Cambrige Univercity Press, 2004.[5] M. Lüscher,
JHEP (2003) 052; Comput.Phys.Commun. , 209 (2004).[6] X.-C. Cai and M. Sarkis,
SIAM J. Sci. Comput. , 792 (1999).[7] S. Aoki et al. [PACS-CS Collaboration], Phys. Rev. D (2009) 034503 [arXiv:0807.1661 [hep-lat]].[8] A. Buttari et al. , ACM Trans. Math. Softw. ,4:1-22 (2008).[9] Open-MX, Myrinet Express over Generic Ethernet Hardware , http://open-mx.gforge.inria.fr/, andreferences there in.[10] NVIDIA corp.,
NVIDIA CUDA Programing Guide