[PDF] The role of multigrid algorithms for LQCD

Abstract

We report on the first successful QCD multigrid algorithm which demonstrates constant convergence rates independent of quark mass and lattice volume for the Wilson Dirac operator. The new ingredient is the adaptive method for constructing the near null space on which the coarse grid multigrid Dirac operator acts. In addition we speculate on future prospects for extending this algorithm to the Domain Wall and Staggered discretizations, its exceptional suitability for high performance GPU code and its potential impact on simulations at the physical pion mass.

Full PDF

TThe role of multigrid algorithms for LQCD

Ronald Babich

Center for Computational Science, Boston University, Boston, MA 02215E-mail: [email protected]

James Brannick

Department of Mathematics, The Pennsylvania State University, University Park, PA 16802E-mail: [email protected]

Richard C. Brower ∗ Department of Physics, Boston University, Boston, MA 02215E-mail: [email protected]

Michael A. Clark

Harvard-Smithsonian Center for Astrophysics andInitiative in Innovative Computing, Harvard University, Cambridge, MA 02138E-mail: [email protected]

Saul D. Cohen

Center for Computational Science, Boston University, Boston, MA 02215E-mail: [email protected]

James C. Osborn

Argonne Leadership Computing Facility, Argonne, IL 60439E-mail: [email protected]

Claudio Rebbi

Department of Physics, Boston University, Boston, MA 02215E-mail: [email protected]

We report on the ﬁrst successful QCD multigrid algorithm which demonstrates constant conver-gence rates independent of quark mass and lattice volume for the Wilson Dirac operator. Thenew ingredient is the adaptive method for constructing the near null space on which the coarsegrid multigrid Dirac operator acts. In addition we speculate on future prospects for extending thisalgorithm to the Domain Wall and Staggered discretizations, its exceptional suitability for highperformance GPU code and its potential impact on simulations at the physical pion mass.

The XXVII International Symposium on Lattice Field TheoryJuly 26-31, 2009Peking University, Beijing, China ∗ Speaker. c (cid:13) Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. http://pos.sissa.it/ a r X i v : . [ h e p - l a t ] D ec he role of multigrid algorithms for LQCD Richard C. Brower

1. Introduction

Perhaps the most severe computational challenge facing lattice Quantum Chromodynamics isthe divergent cost as one approaches the chiral limit required for the experimental value of the pionmass. Similar difﬁculties confront other strongly-coupled ﬁeld theories conjectured for physicsbeyond the standard model. The cause is well known. As the quark mass, m q , approaches zero,the lattice Dirac operator becomes singular, | λ min | →

0, causing “critical slowing down” of theiterative solvers typically used to ﬁnd the propagators. This is unavoidable for all local “unigrid”solvers.It has been almost 20 years since the ﬁrst attempts [1] were made to apply recursive multi-grid preconditioning to the Dirac operator in lattice QCD. The basic idea of using a coarse rep-resentation of the Dirac operator on lattices of increasing lattice spacing appears at ﬁrst to be anobvious extension of the basic principles central to the renormalization group itself. Indeed earlyattempts, generally inspired by this observation, did succeed in formulating a variety of gaugeinvariant coarsening schemes but they all failed to improve convergence at the length scale l σ ,where the underlying lattice gauge ﬁeld becomes rough. Indeed as Brower, Edwards, Rebbi andVicari [2] demonstrated this failure occurred uniformly when the product of the mass gap m q andthe coherence length l σ is of order one: m q l σ = O ( ) . Apparently this failure occurs when the“renormalization” is highly non-perturbative.Recently the application of a new adaptive procedure [3] has decisively broken this barriereliminating “critical slowing down” as the mass gap goes to zero [4]. Here we give a heuristicintroduction to this breakthrough and speculate on further developments.

2. Adaptive Multigrid

While the detailed design of an appropriate adaptive multigrid algorithm for QCD requiresconsiderable effort as reported in Ref. [4] the underlying concept is rather straightforward. Weseek to accelerate the solver for a differential operator discretized on a hypercubic lattice withspacing a D xy ψ y = b x , (2.1)by preconditioning it with a coarse operator ˆ D at a larger lattice spacing ˆ a > a . To be concrete ourexample is the Wilson lattice Dirac operator, D xy ( U ) = ( + m ) δ xy + ∑ µ = [ γ µ − U µ ( x ) δ x , y + µ − γ µ + U † µ ( x ) δ x + µ , y ] , (2.2)where we have suppressed the indices for the 3x3 (dense) SU(3) color matrices U ab µ ( x ) and the 4x4(sparse) spinor matrices γ i j µ in the tensor product. The ﬁne Dirac matrix, D , operates on a complexvector space V of dimension 12 L × T .Critical slowing down is caused by eigenvectors with small eigenvalues. This offending sub-space is the near null space of our operator: D : S (cid:39)

0. Multigrid methods require us to split theﬁne vectors space V into this near null space S and its orthogonal complement S ⊥ : V = S + S ⊥ , inthe language of the renormalization group, splitting the IR (near null) from the UV (rough) modes.2 he role of multigrid algorithms for LQCD Richard C. Brower

One may view this splitting as a generalization of red/black or Schwartz block decompositionsand the resultant preconditioning matrix as akin to using the Schur compliment. This splitting isachieved by a non-square prolongation matrix P which maps the coarse space into the near nullspace S, P : ˆ V → S , (2.3)as illustrated in Fig 1. Then the multigrid cycle constructs a coarse matrix, ˆ D = RDP , as the product

Figure 1:

The non-square prolongation matrix P with ker ( P ) = P : ˆ V = S fromthe coarse vector space ˆ V into the near null subspace S of the ﬁne vector space: V = S + S ⊥ . The fundamentaltheorem of linear algebra gives S = span ( P ) , ˆ V = span ( P † ) and rank ( P ) = rank ( P † ) = dim ( S ) . of the prolongator ( P ) to the near null space on the ﬁne lattice, the ﬁne operator ( D ) and a restrictionoperator ( R ) back to the coarse lattice. We use the Galerkin form by setting R = P † .To understand intuitively how one constructs this mapping, consider multigrid for the classicexample of a d-dimensional discretized Laplace operator. The near null eigenvectors are literallysmooth, dominated by low Fourier components. An obvious interpolation consists of piecewiseconstant functions on regular blocks to deﬁne the coarse degrees of freedom. For example on each4 d block labeled by ˆ x we many introduce the prolongator (or interpolating matrix), P x ˆ x = θ ˆ x ( x ) d , θ ˆ x ( x ) = (cid:40) x ∈ ˆ x block0 x ∈ / ˆ x block , (2.4)where the blocking “theta function” , θ ˆ x ( x ) , is 1 (true) for x inside and 0 (false) outside the blockˆ x . The normalization is chosen so that [ P † P ] ˆ x ˆ y = δ ˆ x , ˆ y . The span of this space consists of all linearcombinations of these basis vectors: ψ x = ∑ ˆ x c ˆ x P x ˆ x . Solving the coarse problem exactly for theerror would reduce the residue to r (cid:48) = P r = ( − DP P † DP P † ) r , where P = − DP P † DP P † , P = P (2.5)is the Petrov-Galerkin (oblique) projection operator with eigenvalues 0 and 1. This projector com-pletely removes the near null space from the residue: P P = S ⊥ of roughmodes are left intact. To damp them out a smoother on the ﬁne lattice must also be applied.Fortunately this basic construction carriers over to the non-trivial example of lattice QCD.However to construct a parameterization for the coarse lattice Dirac operator, a piecewise constant3 he role of multigrid algorithms for LQCD Richard C. Brower interpolation is entirely inappropriate because of the almost random background gauge matrices U connecting nearest neighbor sites. The insight of the adaptive approach is to use the slow conver-gence of near null components itself to deﬁne through the Galerkin scheme the coarse operator.One starts with a random ﬁne vector and attempts to solve the homogeneous equation, D ( U ) ψ ( α ) (cid:39) , (2.6)for an element at critical mass. After a few iterations this yields a global near null vector, which issubsequently broken into blocks as in Eq. 2.4 and used to construct a trial multigrid scheme. Thenif this putative scheme is slow to converge one uses it to solve again for a new near null vectorand repeats until a set of near null vectors, ( ψ ( ) , ψ ( ) , · · · , ψ ( N ν ) ) is found that eliminates criticalslowing down. The prolongator is therefore given by restricting each global vector to blocks by θ ˆ x ( x ) ψ ( α ) x and orthonormalizing the basis on each bock to deﬁne the near null subspace S , P x ; ˆ x , α = orthonormal basis for { θ ˆ x ( x ) ψ ( α ) x } . (2.7)Again the near null space is spanned by this basis: ψ x = ∑ ˆ x , α c ˆ x , α P x ; ˆ x , α . The precise form of theadaptive iteration, the minimum number of global near null vectors N ν and the blocking conﬁgu-ration are all devised to ﬁnd an efﬁcient multigrid preconditioner with minimal complexity. Thecontrast with earlier attempts to construct multigrid algorithms for QCD appears to be rather small.In the projective multigrid scheme [2], near null vectors were found block by block imposingDirichlet boundary condition, more like a Schwarz method. Basically by reversing the procedureto ﬁrst ﬁnding global near null vectors and second restricting them to blocks we have the adap-tive multigrid approach. This is typical of multigrid methods that simple changes have profoundconsequences. The devil is in the details.

3. Performance of MG for Wilson Dirac Operator

There are many technical details that are critical to an efﬁcient adaptive multigrid algorithmfor the Wilson Dirac matrix. Experience ﬁrst guided us to coarsen all color and Dirac degreesof freedom on 4 space-time blocks. However for the Wilson Dirac operator, which is neitherHermitian or normal, it proved to be important to preserve the special property of γ -Hermiticity, D † = γ D γ on the coarse level by splitting each block into two sub-blocks for γ = ± σ = ± σ P = P γ . Finally we implemented a 3 level W-cycle MG algorithm with 4 postsmoothing iterations, using a GCR(8) outer Krylov solver on the ﬁnest level and a CG completesolve on the normal equations on the coarsest. We have clearly achieved a successful MG algorithmfor the Wilson operator which shows little or no sign of critical slowing down as function of thequark mass or lattice size. Already it is competitive with EigCG deﬂation [5] on rather modestlattice sizes (see Figs. 2) and it will become increasingly superior as the lattice become largersince the complexity of exact deﬂation scale like O ( | V | ) whereas multigrid scales no worse than O ( | V | log | V | ) where | V | is the volume of the lattice or size of the ﬁne vectors space.4 he role of multigrid algorithms for LQCD Richard C. Brower -0.43 -0.42 -0.41 -0.4mass01000200030004000500060007000 D i r ac op e r a t o r a pp li ca ti on s CGDeflated CGMG-GCRm crit (setup) -0.43 -0.42 -0.41 -0.4mass0200040006000800010000 D i r ac op e r a t o r a pp li ca ti on s CGDeflated CGMG-GCRm crit (setup)

Figure 2:

Comparison of CG, deﬂated CG[5] and MG-GCR total number of Wilson matrix-vector op-erations until convergence (point sources, V = ×

64 (left plot), V = ×

64 (right plot), β = . m crit = − . N v =

20 (MG-GCR), N v =

240 (deﬂated CG), solver tolerance = − | b | ). With N ν =

20 trial near null vectors this is a very successful multigrid method as illustrated inFig. 3. The horizontal line is the set up cost of constructing the multigrid operator. Table 1 showsthat the iteration count is nearly independent of lattice size and the quark mass, down to the physicalpion mass ( m = − . -0.43 -0.42 -0.41 -0.4 -0.39mass1e+121e+131e+141e+15 F l o a ti ng po i n t op e r a ti on s CGMG-GCRSetup costm crit (setup)

Figure 3:

Comparison of adaptive MG algo-rithm with the conventional red/black precondi-tioned CG algorithm on 32 ×

96 lattice.

Mass: 16 ×

64 24 ×

64 32 × Table 1:

Fine grid iteration count as functionof lattice size and quark mass. he role of multigrid algorithms for LQCD Richard C. Brower

4. Future directions

Let us turn to future directions we are pursuing with the caveat that until we have constructedand benchmarked these extensions, the improvements are speculations based on our current expe-rience. First we have begun to design algorithms for both Staggered and Domain Wall fermiondiscretizations. For Staggered fermions the technical barrier appears modest, since the operator isnormal and anti-Hermitian in the chiral limit. However the “species doubling” quadruples the sizeof the near null space and the Asqtad or HISQ improvements increase the potential complexity ofthe coarsening. Still with the much larger lattices in production we expect to ﬁnd a very usefulimplementation. For the Domain Wall the technical issues are much more subtle but a strategy isemerging. The operator is not only non-Hermitian but the eigenvalues do not have positive realparts as was the case for Wilson and Staggered fermions. Indeed it is essentially a 5-d Wilsonoperator with the wrong sign mass. However the potential advantage of Domain Wall multigrid isgreater. The 5 dimensional Domain Wall matrix operates in a larger vector space and is less wellconditioned because of the heavy ﬂavor modes in the 5th dimension, but its near null space is stillfour dimensional. Thus the truncation to the coarse lattice is more dramatic and in principle thereis more to be gained in a multigrid algorithm. Similar remarks hold for the overlap formulation ofthe Dirac operator but the outer iteration has the advantage of being a normal matrix with positivereal mass gap.A complete suite of multigrid algorithms for Staggered, Wilson and chiral fermion actionsholds out the promise of a major reduction in the cost of Dirac inverters for the analysis stage oflattice QCD ensembles. As the physics correlators for lattice QCD have expanded in the USQCDcollaborations the relative number of ﬂops devoted to analysis is now exceeding 50%. In addi-tion the multigrid kernel can be used in a variance reduction strategy for stochastic estimators ofdisconnected quark diagrams [6].We are also beginning to develop inverters compliant with theSciDAC API for general distribution. Firsts we are extending the API to accommodate the multi-ple lattices and to implement the interpolation and prolongation operators. Also we are optimizingthe N ν × N ν complex matrix operations needed for the coarse operators. These algorithms will befreely distributed on the SciDAC software webpages.In principle MG inverters can be implemented in HMC codes for generating lattice ensem-bles as well. A critical step in this application is to amortize the set up cost of constructing thecoarse operators as the gauge ﬁelds evolve in molecular dynamics time. In this regard Lüscherhas demonstrated [7] that the subspace update for his “little Dirac” operator (which is essentiallyequivalent to our ﬁrst coarse operator ˆ D ) can be used for several HMC time steps combined witha chronological procedure for incremental change in the near null space. This strongly suggeststhat the construction of the multigrid inverter is not a serious overhead. Efﬁcient parallel code is ofcourse another requirement.Finally it is worth closing with a comment on GPU computing. At Boston University we haveimplemented a highly efﬁcient Wilson Dirac CG and BiCGstab implementations for the NvidiaGPU written in the CUDA extension of C [8]. This gives roughly a 5x advantage in cost per-formance for Dirac inversions (see Fig. 4). Preliminary analysis indicates that in many ways thisarchitecture is well suited for the multigrid inverter discussed above. The full implementation ofthis in efﬁcient code has begun and if successful promises a multiplicative advantage in cost per6 he role of multigrid algorithms for LQCD Richard C. Brower G f l op s

12 reconstruct12 reconstruct, GF8 reconstruct8 reconstruct, GF

Figure 4:

Performance of single precision even-oddpreconditioned Wilson-Dirac matrix-vector producton a GTX 280 [8]. GF denotes temporal gauge ﬁxing(lattice volume = 24 × Temporal Extent).

Kernel Kernel CG BiCGstabPrec. (Gﬂops) (Gﬂops) (Gﬂops)Half 12 202.2 170.6 152.5SP 8 134.1 110.1 105.1SP 12 122.1 102.4 98.6DP 12 35.4 33.5 29.3

Table 2:

Performance comparison of thematrix-vector kernels with the associated CGand BiCGstab solvers on the GeForce GTX280 (lattice volume = 24 ×

48) [8]. Gaugeﬁeld stored as 12 or 8 ﬂoats.

Dirac inverter as the product of hardware and algorithm advances. Even without extensions to thegeneration of lattices the combined effect of GPU and MG has the potential of dropping the cost ofanalysis of these lattices by several orders of magnitude relative to current practices.

Acknowledgments

This work was supported in part by US DOE grants DE-FG02-91ER40676 and DE-FC02-06ER41440 and NSF grants DGE-0221680, PHY-0427646, OCI-0749300 and OCI-0749202.

References [1] For a review of earlier work see: T. Kalkreuter, “Multigrid methods for propagators in lattice gaugetheories,” J. Comput. Appl. Math. , 57 (1995) [arXiv:hep-lat/9409008].[2] R. C. Brower, R. G. Edwards, C. Rebbi and E. Vicari, “Projective multigrid for Wilson fermions,”Nucl. Phys. B , 689 (1991).[3] M. Brezina, R. Falgout, S. MacLachlan, T. Manteuffel, S. McCormick, and J. Ruge. Adaptivesmoothed aggregation ( α SA).

Siam J. Sci. Comput. , 25:1896–1920, 2004.[4] J. Brannick, R. C. Brower, M. A. Clark, J. C. Osborn and C. Rebbi, “Adaptive Multigrid Algorithmfor Lattice QCD,” Phys. Rev. Lett. (2008) 041601 [arXiv:0707.4018 [hep-lat]].[5] A. Stathopoulos and K. Orginos, “Computing and deﬂating eigenvalues while solving multiple righthand side linear systems in Quantum Chromodynamics,” arXiv:0707.0131 [hep-lat].[6] R. Babich, R. Brower, M. Clark, G. Fleming, J. Osborn and C. Rebbi, “Strange quark contribution tonucleon form factors,” [arXiv:0710.5536 [hep-lat]].[7] M. Luscher, “Deﬂation acceleration of lattice QCD simulations,” JHEP , 011 (2007)[arXiv:0710.5417 [hep-lat]].[8] M. A. Clark, R. Babich, K. Barros, R. C. Brower and C. Rebbi, “Solving Lattice QCD systems ofequations using mixed precision solvers on GPUs,” arXiv:0911.3191 [hep-lat]., 011 (2007)[arXiv:0710.5417 [hep-lat]].[8] M. A. Clark, R. Babich, K. Barros, R. C. Brower and C. Rebbi, “Solving Lattice QCD systems ofequations using mixed precision solvers on GPUs,” arXiv:0911.3191 [hep-lat].