The role of multigrid algorithms for LQCD
Ronald Babich, James Brannick, Richard C. Brower, Michael A. Clark, Saul D. Cohen, James C. Osborn, Claudio Rebbi
TThe role of multigrid algorithms for LQCD
Ronald Babich
Center for Computational Science, Boston University, Boston, MA 02215E-mail: [email protected]
James Brannick
Department of Mathematics, The Pennsylvania State University, University Park, PA 16802E-mail: [email protected]
Richard C. Brower ∗ Department of Physics, Boston University, Boston, MA 02215E-mail: [email protected]
Michael A. Clark
Harvard-Smithsonian Center for Astrophysics andInitiative in Innovative Computing, Harvard University, Cambridge, MA 02138E-mail: [email protected]
Saul D. Cohen
Center for Computational Science, Boston University, Boston, MA 02215E-mail: [email protected]
James C. Osborn
Argonne Leadership Computing Facility, Argonne, IL 60439E-mail: [email protected]
Claudio Rebbi
Department of Physics, Boston University, Boston, MA 02215E-mail: [email protected]
We report on the first successful QCD multigrid algorithm which demonstrates constant conver-gence rates independent of quark mass and lattice volume for the Wilson Dirac operator. Thenew ingredient is the adaptive method for constructing the near null space on which the coarsegrid multigrid Dirac operator acts. In addition we speculate on future prospects for extending thisalgorithm to the Domain Wall and Staggered discretizations, its exceptional suitability for highperformance GPU code and its potential impact on simulations at the physical pion mass.
The XXVII International Symposium on Lattice Field TheoryJuly 26-31, 2009Peking University, Beijing, China ∗ Speaker. c (cid:13) Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. http://pos.sissa.it/ a r X i v : . [ h e p - l a t ] D ec he role of multigrid algorithms for LQCD Richard C. Brower
1. Introduction
Perhaps the most severe computational challenge facing lattice Quantum Chromodynamics isthe divergent cost as one approaches the chiral limit required for the experimental value of the pionmass. Similar difficulties confront other strongly-coupled field theories conjectured for physicsbeyond the standard model. The cause is well known. As the quark mass, m q , approaches zero,the lattice Dirac operator becomes singular, | λ min | →
0, causing “critical slowing down” of theiterative solvers typically used to find the propagators. This is unavoidable for all local “unigrid”solvers.It has been almost 20 years since the first attempts [1] were made to apply recursive multi-grid preconditioning to the Dirac operator in lattice QCD. The basic idea of using a coarse rep-resentation of the Dirac operator on lattices of increasing lattice spacing appears at first to be anobvious extension of the basic principles central to the renormalization group itself. Indeed earlyattempts, generally inspired by this observation, did succeed in formulating a variety of gaugeinvariant coarsening schemes but they all failed to improve convergence at the length scale l σ ,where the underlying lattice gauge field becomes rough. Indeed as Brower, Edwards, Rebbi andVicari [2] demonstrated this failure occurred uniformly when the product of the mass gap m q andthe coherence length l σ is of order one: m q l σ = O ( ) . Apparently this failure occurs when the“renormalization” is highly non-perturbative.Recently the application of a new adaptive procedure [3] has decisively broken this barriereliminating “critical slowing down” as the mass gap goes to zero [4]. Here we give a heuristicintroduction to this breakthrough and speculate on further developments.
2. Adaptive Multigrid
While the detailed design of an appropriate adaptive multigrid algorithm for QCD requiresconsiderable effort as reported in Ref. [4] the underlying concept is rather straightforward. Weseek to accelerate the solver for a differential operator discretized on a hypercubic lattice withspacing a D xy ψ y = b x , (2.1)by preconditioning it with a coarse operator ˆ D at a larger lattice spacing ˆ a > a . To be concrete ourexample is the Wilson lattice Dirac operator, D xy ( U ) = ( + m ) δ xy + ∑ µ = [ γ µ − U µ ( x ) δ x , y + µ − γ µ + U † µ ( x ) δ x + µ , y ] , (2.2)where we have suppressed the indices for the 3x3 (dense) SU(3) color matrices U ab µ ( x ) and the 4x4(sparse) spinor matrices γ i j µ in the tensor product. The fine Dirac matrix, D , operates on a complexvector space V of dimension 12 L × T .Critical slowing down is caused by eigenvectors with small eigenvalues. This offending sub-space is the near null space of our operator: D : S (cid:39)
0. Multigrid methods require us to split thefine vectors space V into this near null space S and its orthogonal complement S ⊥ : V = S + S ⊥ , inthe language of the renormalization group, splitting the IR (near null) from the UV (rough) modes.2 he role of multigrid algorithms for LQCD Richard C. Brower
One may view this splitting as a generalization of red/black or Schwartz block decompositionsand the resultant preconditioning matrix as akin to using the Schur compliment. This splitting isachieved by a non-square prolongation matrix P which maps the coarse space into the near nullspace S, P : ˆ V → S , (2.3)as illustrated in Fig 1. Then the multigrid cycle constructs a coarse matrix, ˆ D = RDP , as the product
Figure 1:
The non-square prolongation matrix P with ker ( P ) = P : ˆ V = S fromthe coarse vector space ˆ V into the near null subspace S of the fine vector space: V = S + S ⊥ . The fundamentaltheorem of linear algebra gives S = span ( P ) , ˆ V = span ( P † ) and rank ( P ) = rank ( P † ) = dim ( S ) . of the prolongator ( P ) to the near null space on the fine lattice, the fine operator ( D ) and a restrictionoperator ( R ) back to the coarse lattice. We use the Galerkin form by setting R = P † .To understand intuitively how one constructs this mapping, consider multigrid for the classicexample of a d-dimensional discretized Laplace operator. The near null eigenvectors are literallysmooth, dominated by low Fourier components. An obvious interpolation consists of piecewiseconstant functions on regular blocks to define the coarse degrees of freedom. For example on each4 d block labeled by ˆ x we many introduce the prolongator (or interpolating matrix), P x ˆ x = θ ˆ x ( x ) d , θ ˆ x ( x ) = (cid:40) x ∈ ˆ x block0 x ∈ / ˆ x block , (2.4)where the blocking “theta function” , θ ˆ x ( x ) , is 1 (true) for x inside and 0 (false) outside the blockˆ x . The normalization is chosen so that [ P † P ] ˆ x ˆ y = δ ˆ x , ˆ y . The span of this space consists of all linearcombinations of these basis vectors: ψ x = ∑ ˆ x c ˆ x P x ˆ x . Solving the coarse problem exactly for theerror would reduce the residue to r (cid:48) = P r = ( − DP P † DP P † ) r , where P = − DP P † DP P † , P = P (2.5)is the Petrov-Galerkin (oblique) projection operator with eigenvalues 0 and 1. This projector com-pletely removes the near null space from the residue: P P = S ⊥ of roughmodes are left intact. To damp them out a smoother on the fine lattice must also be applied.Fortunately this basic construction carriers over to the non-trivial example of lattice QCD.However to construct a parameterization for the coarse lattice Dirac operator, a piecewise constant3 he role of multigrid algorithms for LQCD Richard C. Brower interpolation is entirely inappropriate because of the almost random background gauge matrices U connecting nearest neighbor sites. The insight of the adaptive approach is to use the slow conver-gence of near null components itself to define through the Galerkin scheme the coarse operator.One starts with a random fine vector and attempts to solve the homogeneous equation, D ( U ) ψ ( α ) (cid:39) , (2.6)for an element at critical mass. After a few iterations this yields a global near null vector, which issubsequently broken into blocks as in Eq. 2.4 and used to construct a trial multigrid scheme. Thenif this putative scheme is slow to converge one uses it to solve again for a new near null vectorand repeats until a set of near null vectors, ( ψ ( ) , ψ ( ) , · · · , ψ ( N ν ) ) is found that eliminates criticalslowing down. The prolongator is therefore given by restricting each global vector to blocks by θ ˆ x ( x ) ψ ( α ) x and orthonormalizing the basis on each bock to define the near null subspace S , P x ; ˆ x , α = orthonormal basis for { θ ˆ x ( x ) ψ ( α ) x } . (2.7)Again the near null space is spanned by this basis: ψ x = ∑ ˆ x , α c ˆ x , α P x ; ˆ x , α . The precise form of theadaptive iteration, the minimum number of global near null vectors N ν and the blocking configu-ration are all devised to find an efficient multigrid preconditioner with minimal complexity. Thecontrast with earlier attempts to construct multigrid algorithms for QCD appears to be rather small.In the projective multigrid scheme [2], near null vectors were found block by block imposingDirichlet boundary condition, more like a Schwarz method. Basically by reversing the procedureto first finding global near null vectors and second restricting them to blocks we have the adap-tive multigrid approach. This is typical of multigrid methods that simple changes have profoundconsequences. The devil is in the details.
3. Performance of MG for Wilson Dirac Operator
There are many technical details that are critical to an efficient adaptive multigrid algorithmfor the Wilson Dirac matrix. Experience first guided us to coarsen all color and Dirac degreesof freedom on 4 space-time blocks. However for the Wilson Dirac operator, which is neitherHermitian or normal, it proved to be important to preserve the special property of γ -Hermiticity, D † = γ D γ on the coarse level by splitting each block into two sub-blocks for γ = ± σ = ± σ P = P γ . Finally we implemented a 3 level W-cycle MG algorithm with 4 postsmoothing iterations, using a GCR(8) outer Krylov solver on the finest level and a CG completesolve on the normal equations on the coarsest. We have clearly achieved a successful MG algorithmfor the Wilson operator which shows little or no sign of critical slowing down as function of thequark mass or lattice size. Already it is competitive with EigCG deflation [5] on rather modestlattice sizes (see Figs. 2) and it will become increasingly superior as the lattice become largersince the complexity of exact deflation scale like O ( | V | ) whereas multigrid scales no worse than O ( | V | log | V | ) where | V | is the volume of the lattice or size of the fine vectors space.4 he role of multigrid algorithms for LQCD Richard C. Brower -0.43 -0.42 -0.41 -0.4mass01000200030004000500060007000 D i r ac op e r a t o r a pp li ca ti on s CGDeflated CGMG-GCRm crit (setup) -0.43 -0.42 -0.41 -0.4mass0200040006000800010000 D i r ac op e r a t o r a pp li ca ti on s CGDeflated CGMG-GCRm crit (setup)
Figure 2:
Comparison of CG, deflated CG[5] and MG-GCR total number of Wilson matrix-vector op-erations until convergence (point sources, V = ×
64 (left plot), V = ×
64 (right plot), β = . m crit = − . N v =
20 (MG-GCR), N v =
240 (deflated CG), solver tolerance = − | b | ). With N ν =
20 trial near null vectors this is a very successful multigrid method as illustrated inFig. 3. The horizontal line is the set up cost of constructing the multigrid operator. Table 1 showsthat the iteration count is nearly independent of lattice size and the quark mass, down to the physicalpion mass ( m = − . -0.43 -0.42 -0.41 -0.4 -0.39mass1e+121e+131e+141e+15 F l o a ti ng po i n t op e r a ti on s CGMG-GCRSetup costm crit (setup)
Figure 3:
Comparison of adaptive MG algo-rithm with the conventional red/black precondi-tioned CG algorithm on 32 ×
96 lattice.
Mass: 16 ×
64 24 ×
64 32 × Table 1:
Fine grid iteration count as functionof lattice size and quark mass. he role of multigrid algorithms for LQCD Richard C. Brower
4. Future directions
Let us turn to future directions we are pursuing with the caveat that until we have constructedand benchmarked these extensions, the improvements are speculations based on our current expe-rience. First we have begun to design algorithms for both Staggered and Domain Wall fermiondiscretizations. For Staggered fermions the technical barrier appears modest, since the operator isnormal and anti-Hermitian in the chiral limit. However the “species doubling” quadruples the sizeof the near null space and the Asqtad or HISQ improvements increase the potential complexity ofthe coarsening. Still with the much larger lattices in production we expect to find a very usefulimplementation. For the Domain Wall the technical issues are much more subtle but a strategy isemerging. The operator is not only non-Hermitian but the eigenvalues do not have positive realparts as was the case for Wilson and Staggered fermions. Indeed it is essentially a 5-d Wilsonoperator with the wrong sign mass. However the potential advantage of Domain Wall multigrid isgreater. The 5 dimensional Domain Wall matrix operates in a larger vector space and is less wellconditioned because of the heavy flavor modes in the 5th dimension, but its near null space is stillfour dimensional. Thus the truncation to the coarse lattice is more dramatic and in principle thereis more to be gained in a multigrid algorithm. Similar remarks hold for the overlap formulation ofthe Dirac operator but the outer iteration has the advantage of being a normal matrix with positivereal mass gap.A complete suite of multigrid algorithms for Staggered, Wilson and chiral fermion actionsholds out the promise of a major reduction in the cost of Dirac inverters for the analysis stage oflattice QCD ensembles. As the physics correlators for lattice QCD have expanded in the USQCDcollaborations the relative number of flops devoted to analysis is now exceeding 50%. In addi-tion the multigrid kernel can be used in a variance reduction strategy for stochastic estimators ofdisconnected quark diagrams [6].We are also beginning to develop inverters compliant with theSciDAC API for general distribution. Firsts we are extending the API to accommodate the multi-ple lattices and to implement the interpolation and prolongation operators. Also we are optimizingthe N ν × N ν complex matrix operations needed for the coarse operators. These algorithms will befreely distributed on the SciDAC software webpages.In principle MG inverters can be implemented in HMC codes for generating lattice ensem-bles as well. A critical step in this application is to amortize the set up cost of constructing thecoarse operators as the gauge fields evolve in molecular dynamics time. In this regard Lüscherhas demonstrated [7] that the subspace update for his “little Dirac” operator (which is essentiallyequivalent to our first coarse operator ˆ D ) can be used for several HMC time steps combined witha chronological procedure for incremental change in the near null space. This strongly suggeststhat the construction of the multigrid inverter is not a serious overhead. Efficient parallel code is ofcourse another requirement.Finally it is worth closing with a comment on GPU computing. At Boston University we haveimplemented a highly efficient Wilson Dirac CG and BiCGstab implementations for the NvidiaGPU written in the CUDA extension of C [8]. This gives roughly a 5x advantage in cost per-formance for Dirac inversions (see Fig. 4). Preliminary analysis indicates that in many ways thisarchitecture is well suited for the multigrid inverter discussed above. The full implementation ofthis in efficient code has begun and if successful promises a multiplicative advantage in cost per6 he role of multigrid algorithms for LQCD Richard C. Brower G f l op s
12 reconstruct12 reconstruct, GF8 reconstruct8 reconstruct, GF
Figure 4:
Performance of single precision even-oddpreconditioned Wilson-Dirac matrix-vector producton a GTX 280 [8]. GF denotes temporal gauge fixing(lattice volume = 24 × Temporal Extent).
Kernel Kernel CG BiCGstabPrec. (Gflops) (Gflops) (Gflops)Half 12 202.2 170.6 152.5SP 8 134.1 110.1 105.1SP 12 122.1 102.4 98.6DP 12 35.4 33.5 29.3
Table 2:
Performance comparison of thematrix-vector kernels with the associated CGand BiCGstab solvers on the GeForce GTX280 (lattice volume = 24 ×
48) [8]. Gaugefield stored as 12 or 8 floats.
Dirac inverter as the product of hardware and algorithm advances. Even without extensions to thegeneration of lattices the combined effect of GPU and MG has the potential of dropping the cost ofanalysis of these lattices by several orders of magnitude relative to current practices.
Acknowledgments
This work was supported in part by US DOE grants DE-FG02-91ER40676 and DE-FC02-06ER41440 and NSF grants DGE-0221680, PHY-0427646, OCI-0749300 and OCI-0749202.
References [1] For a review of earlier work see: T. Kalkreuter, “Multigrid methods for propagators in lattice gaugetheories,” J. Comput. Appl. Math. , 57 (1995) [arXiv:hep-lat/9409008].[2] R. C. Brower, R. G. Edwards, C. Rebbi and E. Vicari, “Projective multigrid for Wilson fermions,”Nucl. Phys. B , 689 (1991).[3] M. Brezina, R. Falgout, S. MacLachlan, T. Manteuffel, S. McCormick, and J. Ruge. Adaptivesmoothed aggregation ( α SA).
Siam J. Sci. Comput. , 25:1896–1920, 2004.[4] J. Brannick, R. C. Brower, M. A. Clark, J. C. Osborn and C. Rebbi, “Adaptive Multigrid Algorithmfor Lattice QCD,” Phys. Rev. Lett. (2008) 041601 [arXiv:0707.4018 [hep-lat]].[5] A. Stathopoulos and K. Orginos, “Computing and deflating eigenvalues while solving multiple righthand side linear systems in Quantum Chromodynamics,” arXiv:0707.0131 [hep-lat].[6] R. Babich, R. Brower, M. Clark, G. Fleming, J. Osborn and C. Rebbi, “Strange quark contribution tonucleon form factors,” [arXiv:0710.5536 [hep-lat]].[7] M. Luscher, “Deflation acceleration of lattice QCD simulations,” JHEP , 011 (2007)[arXiv:0710.5417 [hep-lat]].[8] M. A. Clark, R. Babich, K. Barros, R. C. Brower and C. Rebbi, “Solving Lattice QCD systems ofequations using mixed precision solvers on GPUs,” arXiv:0911.3191 [hep-lat]., 011 (2007)[arXiv:0710.5417 [hep-lat]].[8] M. A. Clark, R. Babich, K. Barros, R. C. Brower and C. Rebbi, “Solving Lattice QCD systems ofequations using mixed precision solvers on GPUs,” arXiv:0911.3191 [hep-lat].