tmLQCD: a program suite to simulate Wilson Twisted mass Lattice QCD
aa r X i v : . [ h e p - l a t ] M a y tmLQCD: a program suite to simulate WilsonTwisted mass Lattice QCD Karl Jansen a and Carsten Urbach b , a DESY, Patanenallee 6, Zeuthen, Germany b Humboldt-Univertit¨at zu Berlin, Institut f¨ur Physik,Newtonstr. 15, 12489 Berlin, Germany
Abstract
We discuss a program suite for simulating Quantum Chromodynamics on a 4-dimensional space-time lattice. The basic Hybrid Monte Carlo algorithm is intro-duced and a number of algorithmic improvements are explained. We then discussthe implementations of these concepts as well as our parallelisation strategy in theactual simulation code. Finally, we provide a user guide to compile and run theprogram.PACS: 11.15.-q, 11.15.Ha, 11.38.-t, 11.38.GcPreprint-Numbers: DESY 09-073, HU-EP-09/23, SFB/CPP-09-43
Key words:
Hybrid Monte Carlo algorithm; Lattice QCD;
PROGRAM SUMMARY
Manuscript Title: tmLQCD: a program suite to simulate Wilson Twisted mass Lat-tice QCDAuthors: K. Jansen and C. UrbachProgram Title: tmLQCDJournal Reference:Catalogue identifier:Licensing provisions:
GNU General Public License (GPL)
Programming language: C Computer: any
Operating system: any with a standard C compiler
RAM: no typical values available
Number of processors used: Corresponding author
Preprint submitted to Elsevier 29 August 2018 ne or optionally any even number of processors. Tested with up to 32768 proces-sors.
Keywords:
Hybrid Monte Carlo algorithm; Lattice QCD;
PACS:
Classification:
External routines/libraries:
LAPACK [1] and LIME [2] library.
Nature of problem:
Quantum Chromodynamics.
Solution method:
Markov Chain Monte Carlo using the Hybrid Monte Carlo algorithm with masspreconditioning and multiple time scales [3]. Iterative solver for large systems oflinear equations.
Restrictions:
Restricted to an even number of (not necessarily mass degenerate) quark flavoursin the Wilson or Wilson twisted mass formulation of lattice QCD.
Additional comments: none.
Running time:
Depending on the problem size, the architecture and the input parameters from afew minutes to weeks.
References: [1] .[2] USQCD, http://usqcd.jlab.org/usqcd-docs/c-lime/ .[3] C. Urbach, K. Jansen, A. Shindler and U. Wenger, Comput. Phys. Commun. , 87 (2006). ONG WRITE-UP1 Introduction
This contribution to the anniversary issue of CPC deals with the strong forcein Particle Physics. The strong force is presumably the least well understoodfundamental interaction between elementary particles. It is responsible for theexistence of protons and neutrons, or more generally all nuclei, as bound states.The constituents of the nuclei are the quarks and gluons as the fundamentalparticles. It is interesting to observe that the energy (mass) of a proton has asize of about 1GeV while the mass of the two constituent up and down quarksis at the order of only a few MeV. Hence, the by far biggest contribution tothe proton mass is pure binding energy.This shows already that a description of the proton in terms of the underlyingquark and gluon degrees of freedom must be highly non-trivial. The model thatis believed to provide a theoretical framework for the strong interaction andshould give such a description is Quantum Chromodynamics (QCD). Althoughthis theory can be written in a very compact mathematical form, it is a highlynon-linear theory that does not allow for a closed analytical solution.However, and rather fortunately, QCD can be reformulated in such a waythat computational physics methods can be applied to calculate propertiesof QCD from first principles and without relying on approximations. In thisapproach, space and time are rendered discrete and a lattice spacing a isintroduced. Thus, a 4-dimensional space-time lattice is considered and thequark and gluon degrees of freedom are placed on the lattice points or onso-called links that connect lattice points. In this way one obtains a modelof “quark” spins, which are coupled to nearest neighbours only, very muchreminiscent of an Ising model in statistical physics. Indeed, the methods ofstatistical physics, namely the evaluation of the partition function by meansof numerical simulations using Monte Carlo methods employing importancesampling, are the key to address QCD on the 4-dimensional lattice, which wewill refer to as lattice QCD (LQCD).Although the concepts to treat LQCD numerically are very clear, the problemhas an intrinsically extremely high computational demand. The crucial factoris that in the end the introduced discretisation has to be removed again. Ifwe consider a lattice with say, L = 3 fm linear extent and a lattice spacing of a = 0 . N = 3 / . lattice points for such a more or less reasonable physical situation. Simu-lations on such a lattice require in LQCD already several Teraflops. Keeping3 flops · years [1] m PS /m V (a) [2][1]Tflops · years m PS /m V (b)Fig. 1. Computer resources needed to generate 1000 independent configurations ofsize 24 ×
40 at a lattice spacing of about 0 .
08 fm with pure Wilson fermions inunits of Tflops · years as a function of m PS /m V . In (a) we show the prediction ofRef. [1] from the 2001 lattice conference in Berlin (hence titled “Berlin Wall”). In(b) we compare to the formula of Ref. [1] (solid line) with the performance of thealgorithm described in this paper and first published in Ref. [2]. The dashed line isto guide the eye. The scale of the vertical axis changes by about a factor of 1 / T = 40. L = 3 fm fixed and decreasing the lattice spacing to remove the discretisationby, say, a factor of two increases the cost of the simulation already by a factor2 . As this would not be worse enough, the used algorithms contribute anotherfactor of 2 (2 − . Hence, going to really fine discretisations where the effects ofthe non-zero lattice spacing can safely be neglected or at least a controlledextrapolation to zero values of the lattice spacing can be performed is an ex-tremely demanding computational challenge which will finally require at leastPetaflops computing, an area of computing power we are realising today.However, even with the advent of Petaflops computers, the goal of “solving”QCD on a lattice would be completely out of reach without some algorithmicimprovements that were invented in recent years. This is shown in fig. 1. Inthe left panel of the graph, the number of Teraflops years for a certain typicalsimulation is shown as a function of the ratio of two meson masses, the pseudoscalar and the vector meson. The graph derives from the known cost of theused algorithm in the year 2001 [1].It is important to realise that the meson mass ratio used in fig. 1 assumes avalue of about 0.2 in nature as measured in experiments, which is indicatedby the arrow in both panels of fig. 1. The figure clearly demonstrates the4trongly growing cost of the simulations when the real physical situation isto be reached. In fact, simulations directly at the physical value of this massratio were impossible in 2001. The right panel of the graph demonstrates thechange of the situation when algorithmic improvements are used as the onesdescribed in this article. In fact, the simulation costs shown in the right panelof fig. 1 originate from direct performance measurements of the code that isdescribed here. As the figure demonstrates, although the simulations at thephysical value of the meson mass ratio are still rather demanding, they becomeclearly realistic with todays Petaflops systems. There are other approaches toimprove the HMC algorithm with similar results [3,4,5,6,7,8]. Very promising isthe recent additional improvement using inexact deflation presented in Ref. [9].The algorithmic improvements provided therefore a tremendous gain openinga way for simulations in LQCD that were unthinkable a few years ago. Itis precisely the goal of this contribution to describe one programme versionof the underlying Hybrid Monte Carlo (HMC) algorithm where a number ofsuch improvements have been incorporated and to make the correspondingcode publicly available. The current version of the code and future updatescan be downloaded from the web [10]. Quantum Chromodynamics on a hyper-cubic Euclidean space-time lattice ofsize L × T with lattice spacing a is formally described by the action S = S G [ U ] + a X x ¯ ψ D [ U ] ψ (1)with S G some suitable discretisation of the the Yang-Mills action F µν / D is a discretisation of the Dirac operator, for which Wilsonoriginally proposed [12] to use the so called Wilson Dirac operator D W [ U ] = 12 h γ µ (cid:16) ∇ µ + ∇ ∗ µ (cid:17) − a ∇ ∗ µ ∇ µ i (2)5ith ∇ µ and ∇ ∗ µ the forward and backward gauge covariant difference opera-tors, respectively: ∇ µ ψ ( x ) = 1 a (cid:20) U ( x, x + a ˆ µ ) ψ ( x + a ˆ µ ) − ψ ( x ) (cid:21) , ∇ ∗ µ ψ ( x ) = 1 a (cid:20) ψ ( x ) − U † ( x, x − a ˆ µ ) ψ ( x − a ˆ µ ) (cid:21) , (3)where we denote the SU(3) link variables by U x,µ . We shall set a ≡ D tm = ( D W [ U ] + m ) 1 f + iµ q γ τ (4)for a mass degenerate doublet of quarks. We denote by m the bare (Wilson)quark mass, µ q is the bare twisted mass parameter, τ i the i -th Pauli matrix and1 f the unit matrix acting in flavour space (see appendix A for our convention).In the framework of Wilson twisted mass QCD only flavour doublets of quarkscan be simulated, however, the two quarks do not need to be degenerate inmass. The corresponding mass non-degenerate flavour doublet reads [14] D h (¯ µ, ¯ ǫ ) = D W f + i ¯ µγ τ + ¯ ǫτ . (5)Note that this notation is not unique. Equivalently – as used in Ref. [15] – onemay write D ′ h ( µ σ , µ δ ) = D W · f + iγ µ σ τ + µ δ τ , (6)which is related to D h by D ′ h = (1 + iτ ) D h (1 − iτ ) / µ σ , µ δ ) → (¯ µ, − ¯ ǫ ). For the purpose of introducing the Hybrid Monte Carlo (HMC) algorithmwe shall consider only the Wilson twisted mass formulation of lattice QCDwith one q doublet of mass degenerate quarks with bare quark mass m andbare twisted mass µ q . The extension to more than one flavour doublet ofquarks is straightforward. The corresponding polynomial HMC algorithm usedfor simulating the mass non-degenerate flavour doublet is discussed in thefollowing sub-section.After integrating out the Grassmann valued fermion fields, in lattice QCD oneneeds to evaluate the integral Z D U det( Q † Q ) e − S G , (7)6y Markov Chain Monte Carlo methods with some discretisation of the Yang-Mills gauge action S G and Q ≡ γ D W [ U ] + γ m + iµ q , (8)with the Wilson-Dirac operator D W of eq. (2). Note that Q acts now on oneflavour only. The determinant can be re-expressed using complex valued, so-called pseudo fermion fields φ and φ † det( Q ) ∝ Z D φ D φ † e − ( Q − φ,Q − φ ) (9)where S PF ≡ | Q − φ | is called the pseudo fermion action. The HMC algo-rithm [16] is then defined by introducing traceless hermitian momenta P x,µ (conjugate to the fundamental fields U x,µ ) and a Hamiltonian H ( U, P ) = X x,µ
12 Tr[ P x,µ ] + S G [ U ] + S PF [ U ] . (10)Given H , the algorithm is composed out of a molecular dynamics update ofthe fields ( U, P ) → ( U ′ , P ′ ) and a Metropolis accept/reject step with respectto H using the acceptance probability P acc = min(1 , exp ( H ( U ′ , P ′ ) − H ( U, P )) . (11)While the momenta P are generated at the beginning of a trajectory – in theso called heat-bath step – randomly from a Gaussian distribution, the pseudofermion fields φ are generated by first generating random fields r and then φ = Qr such that exp {− ( Q − φ, Q − φ ) } = exp { r † r } . Note that the pseudo fermionfields are not evolved during the molecular dynamics part of the HMC algo-rithm. In the molecular dynamics (MD) part of the HMC algorithm the momentaand gauge fields are updated corresponding to the Hamiltonian equations ofmotion ddτ P x,µ = − F ( x, µ ) ,ddτ U x,µ = P x,µ U x,µ (12)with respect to a fictitious computer time τ and forces F which are obtainedby differentiating the action with respect to the gauge fields U , and takes7alues in the Lie algebra of SU(3). The differentiation D U of some function f ( U ) is defined as D aU f ( U ) = ∂∂α f ( e iαt a U ) | α =0 , where t a are the generators of su(3).Since these equations can in general not be integrated analytically, one usesnumerical integration methods, which must be area preserving and reversible.Symmetrised symplectic integrators fulfil these requirements with the simplestexample being the leap-frog algorithm. The basic discrete update steps withintegration step size ∆ τ of the gauge field and the momenta can be defined as T U (∆ τ ) : U → U ′ = exp ( i ∆ τ P ) U ,T S (∆ τ ) : P → P ′ = P − i ∆ τ F . (13)The leap-frog algorithm is then obtained by sequential application of T = T S (∆ τ / T U (∆ τ ) T S (∆ τ / , i.e. for a trajectory of length τ one needs to apply T N MD with N MD = τ / ∆ τ . Preconditioning is usually performed by factorisingdet( Q † Q ) = det( R † R ) · det( R † R ) · · · det( R † n R n ) (14)with suitably chosen R , R , . . . R n . Then For every R i a separate pseudofermion field φ i is introduced, such that the Hamiltonian reads H ( U, P ) = X x,µ
12 Tr[ P x,µ ] + S G [ U ] + n X i =1 S PF i . (15)and the equations of motion are changed to ddτ P x,µ = − n X i =0 F i ( x, µ ) ddτ U x,µ = P x,µ U x,µ where we identify F with the force stemming from the gauge action S G .The factorisation in eq. (14) can be achieved in many different ways, see forinstance Refs. [3,4,5,6,7,8]. Here we shall only discuss what is known as masspreconditioning or Hasenbusch trick [17,18,19]. It is obtained by writing theidentity det( Q † Q ) = det( W † W ) · det( Q † Q )det( W † W ) , (16)8here W = D W + m + iµ γ , µ > µ q . By adjusting the the additional mass parameter µ , the condition number of W † W and ( Q † Q ) / ( W † W ) can both be reduced with respect to the one of Q † Q alone. As argued in Ref. [20], the optimal choice leads to a condition numberof √ k for both W † W and ( Q † Q ) / ( W † W ), where k is the condition number of Q † Q . A reduced condition number leads to reduced force contributions in theMD evolution and allows hence for larger values of ∆ τ .It is important to notice that evaluating the force contribution stemming from( Q † Q ) / ( W † W ) is more expensive in terms of computer time than the evalua-tion of the contribution from W † W , since it involves the iterative solution of ϕ = ( Q † Q ) − φ with the large condition number k . Thus, the algorithm mightbe further improved by not tuning the condition numbers equal as explainedbeforehand, but by introducing a multiple time scale integration scheme asfollows.Considering a Hamiltonian like in eq. (15) we may introduce n + 1 timescales∆ τ i with ∆ τ i = τN MD i , N MD i = N n · N n − · · · N i and basic discrete update steps T U (∆ τ ) : U → U ′ = exp ( i ∆ τ P ) U ,T S i (∆ τ ) : P → P ′ = P − i ∆ τ F i (17)with 0 ≤ i ≤ n . We have identified S ≡ S G . The leap frog update on timescale i is then recursively defined as T i = T S i (∆ τ i / T U (∆ τ i ) T S i (∆ τ i / i = 0 T S i (∆ τ i /
2) [ T i − ] N i − T S i (∆ τ i /
2) 0 < i ≤ n (18)and the full trajectory of length τ is eventually achieved by [ T n ] N n .As was shown in Ref. [2] – and for other factorisations of the determinantin Refs. [3,7,21] – the combination of multiple time scale integration and adeterminant factorisation allows to set the algorithm up such that the mostexpensive operator contributes least to the MD forces. It can then be inte-grated on the outermost timescale and must be less often inverted. During the last paragraphs we have introduced the simplest reversible andarea preserving integration scheme, known as leap frog integration scheme.9here are more involved integration schemes available, partly or completelycancelling higher order discretisation errors.It turns out that completely cancelling higher order effects is not necessaryand often even not efficient. Integration schemes with reduced errors are forexample the so called n -th order minimal norm integration schemes, for detailssee Ref. [22] and references therein. The second order minimal norm (2MN)integration scheme is based on the update step T = T S ( λ ∆ τ ) T U (∆ τ / T S ((1 − λ )∆ τ ) T U (∆ τ / T S ( λ ∆ τ ) ,T i = T S i ( λ i ∆ τ i ) [ T i − ] N i − T S i ((1 − λ i )∆ τ i )[ T i − ] N i − T S i ( λ i ∆ τ i ) , (19) λ i is a dimensionless parameter and the 2MN scheme coincides with theSexton-Weingarten scheme [23] in case λ i = 1 /
6. The optimal value for λ i was given in Ref. [22] to be around 0 .
19. But its value is likely to depend onthe mass values and the time scale under consideration. Note that there is aparameter λ i for each timescale ∆ τ i , which can be tuned separately.While all the integration schemes introduced so far were based on the order T S T U T S , it is also possible to revert this order. In this case one talks aboutthe position version of the corresponding integration scheme, while the usualone is called the velocity version. Under certain circumstances they can bemore efficient, because one less application of T S is needed. The correspondingupdate steps can be easily derived from the formulae provided above. In the framework of Wilson twisted mass fermions it is only possible to sim-ulate flavour doublets of quarks. Hence, if one wants to include the strangequark in the simulation one also needs to include the charm. The correspond-ing mass non-degenerate doublet was defined in equation (5). Simulating sucha flavour doublet operator is possible using the polynomial HMC (PHMC) al-gorithm [24,25,26]. The basic problem that occurs in the mass non-degeneratecase is that a single flavour has to be taken into account or equivalently thedeterminant of a single operator Q needs to be treated. The PHMC algorithmcan solve this problem elegantly.The idea of the PHMC is based on writingdet( Q ) = det( q Q ) ≈ det( P − ǫ,n ( Q )) ∝ Z D φ D φ † e − φ † P φ , Q is positive. P ǫ,n ( Q ) is a polynomial approximation of 1 / √ Q of degree n in the interval [ ǫ, P n,ǫ ( s ) = 1 √ s { R n,ǫ } , s = Q . (20) R n,ǫ is the error term. It can be shown that for the case of Chebysheff polyno-mials | R | vanishes exponentially fast with the degree n (for large n ). For moredetails regarding this issue we refer the reader for instance to Refs.[27,28] andreferences therein.It is worth noticing that representing in inverse operator by a polynomial hasconceptual advantages. It allows to treat certain regions of the eigenvalue spec-trum of the operator in different ways and to separate therefore the infraredfrom the bulk and ultraviolet parts of the spectrum. Although this has beenthe main underlying idea of the PHMC algorithm [24,25,26] we will use it here,however, only as a technical tool to treat single flavours in the simulations.For our purpose – introducing Q h = γ D h – we can rewrite the correspondingdeterminant det( Q h ) ∝ Z D Φ † D Φ e − Φ † P n,ǫ ( s )Φ , with s = Q † h Q h and the pseudo fermion fields Φ are now two flavour fields.Note that D † h = τ γ D h γ τ . The application of the polynomial P to a pseudofermion field Φ can be performed by either using the Clenshaw recursion re-lation [29], or by using the product representation P n,ǫ ( s )Φ = " n Y i =1 c ( s − z i ) Φ ≡ B ( s ) · B ( s ) † Φwith z i the complex roots of P and a suitably chosen normalisation constant c . The product representation is conveniently used in the MD update. For thechoice of polynomials, the determination of their roots and how to order themto avoid round-off errors see appendix C.The HMC algorithm requires an area preserving and reversible MD updateprocedure, however, there is no need to use in the MD update the same op-erator as in the heat-bath step. As long as the acceptance rate is sufficientlyhigh, we are free to use any other operator in the update. In order to exploitthis possibility we introduce a second more precise polynomial˜ P m,δ ( s ) = 1 P n,ǫ √ s { R m,δ } (21)which is used in the heat-bath step to generate the pseudo fermion fields froma random field R Φ = ˜
P B † Q h R P is then used only in the MD update.The polynomial degrees n, m and the approximation intervals have to be de-termined such as to guarantee a good approximation of 1 / √ s in the range ofeigenvalues of Q † h Q h . One may also adopt a strategy to chose ǫ or δ largerthan a few lowest eigenvalues of Q † h Q h and use re-weighting to correct forthis [24,25]. Even/Odd preconditioning
The (P)HMC algorithm is implemented using even/odd preconditioning [30,31],which is discussed shortly in appendix B. We want to stress that althougheven/odd preconditioning is a rather technical step, it leads to a very impor-tant improvement of the algorithm performance and is a cornerstone of allHMC implementations in the field.
The theory is discretised and put on a finite, hyper-cubic space-time latticewith extensions L × T ≡ Q µ L µ . The boundary conditions for the gauge fields U x,µ are chosen to be periodic, i.e. U x + L ν ˆ ν,µ = U x,µ , where ˆ ν is a unit vector in direction ν . For the fermionic fields ψ ( x ) we allowfor more general boundary conditions, namely so called twisted boundaryconditions ψ ( x + L ν ˆ ν ) = e iθ ν π ψ ( x ) . Periodic boundary conditions correspond to θ ν = 0, while anti-periodic bound-ary conditions are achieved by setting θ ν = 1. More generally one can realisewith twisted boundary conditions arbitrary values of momentum transfer onthe lattice by a convenient re-interpretation of the phases [32]. The general strategy of the tmLQCD package is to provide programs for themain applications used in lattice QCD with Wilson twisted mass fermions.The code and the algorithms are designed to be general enough such as tocompile and run efficiently on any modern computer architecture. This is12 ig. 2. Flowchart for the hmc tm executable achieved code-wise by using standard C as programming language and forparallelisation the message passing interface (MPI) standard version 1.1.Performance improvements are achieved by providing dedicated code for cer-tain widely used architectures, like PC’s or the Blue Gene family. Dedicatedcode is mainly available for the kernel routine – the application of the Diracoperator, which will be discussed in detail in section 4.1, and for the commu-nication routines.The tmLQCD package provides three main applications. The first is an imple-mentation of the (P)HMC algorithm, the second and the third are executablesto invert the Wilson twisted mass Dirac operator (4) and the non-degenerateWilson twisted mass Dirac operator (5), respectively. All three do have a widerange of run-time options, which can be influenced using an input file. Thesyntax of the input file is explained in the documentation which ships with thesource code. The relevant input parameters will be mentioned in the followingwhere appropriate, to ease usage.We shall firstly discuss the general layout of the three aforementioned appli-cations, followed by a general discussion of the parallelisation strategy used inall three of them. 13 .1 hmc tm In figure 2 the programme flow of the hmc tm executable is depicted. In thefirst block the input file is parsed and parameters are set accordingly. Thenthe required memory is allocated and, depending on the input parameters,data is read from disk in order to continue a previous run.The main part of this application is the molecular dynamics update. For anumber of trajectories, which must be specified in the input file, first a heat-bath is performed, then the integration according to the equations of motionusing the integrator as specified in the input file, and finally the acceptancestep.After each trajectory certain online measurements are performed, such as mea-suring the plaquette value. Other online measurements are optional, like mea-suring the pseudo scalar correlation function.
The programme offers command line options as follows: • -h|? prints a help message and exits. • -f input file name. The default is hmc.input • -o the prefix of the output filenames. The default is output . The code willgenerate or append to two files, output.data and output.para . The parameters of each run are read from an input file with default name hmc.input . If it is missing all parameters will be set to their default values.Any parameter not set in the input file will also be set to its default value.During the run the hmc tm program will generate two output files, one calledper default output.data , the other one output.para . Into the latter impor-tant parameters will be written at the beginning of the run.The file output.data has several columns with the following meanings(1) Plaquette value.(2) ∆ H (3) exp( − ∆ H )(4) a pair of integers for each pseudo fermion monomial. The first integer ofeach pair is the sum of solver iterations needed in the acceptance and14 ig. 3. Flowchart for the main part of the invert and invert doublet executables. heatbath steps, the second is the sum of iterations needed for the forcecomputation of the whole trajectory.(5) Acceptance (0 or 1).(6) Time in seconds needed for this trajectory.(7) Value of the rectangle part in the gauge action, if used.Every new run will append its numbers to an already existing file.In addition, the program will create a file history hmc tm . This file providesa mapping between the configuration number and its plaquette and Polyakovloop values. Moreover the simulation parameters are stored there and in caseof a reread the time point can be found there.After every trajectory the program will save the current configuration in thefile conf.save . invert and invert doublet The two applications invert and invert doublet are very similar. The maindifference is that in invert the one flavour Wilson twisted mass Dirac op-erator is inverted, whereas in invert doublet the non-degenerate doublet isinverted.The main part of the two executables is depicted in figure 3. Each measurementcorresponds to one gauge configuration that is read from disk into memory. Foreach of these gauge configurations a number of inversions will be performed.15he sources can be either generated or read in from disk. In the former casethe programme can currently generate point sources at random location inspace time. In the latter case the name of the source file can be specified inthe input file.The relevant Dirac operator is then inverted on each source and the result isstored on disk. The inversion can be performed with a number of inversionalgorithms, such as conjugate gradient (CG), BiCGstab, and others [33]. Andoptionally even/odd preconditioning as described previously can be used.
The two programmes offer command line options as follows: • -h|? prints a help message and exits. • -f input file name. The default is hmc.input • -o the prefix of the output filenames. The default is output . The code willgenerate or append to one file called output.para . The program will create a file called output.data with information aboutthe parameters of the run. Of course, also the propagators are stored on disk.The corresponding file names can be influenced via input parameters. The fileformat is discussed in some detail in sub-section 4.7.One particularity of the invert doublet program is that the propagatorswritten to disk correspond to the two flavour Dirac operator of eq. (6), i.e. D ′ h ( µ σ , µ δ ) = D W · f + iµ σ τ + γ µ δ τ , essentially for compatibility reasons. For the two flavour components writtenthe first is the would be strange component and the second one the would be charm one. The whole lattice can be parallelised in up to 4 space-time directions. It iscontrolled with configure switches, see section 5.2. The Message Passing In-terface (MPI, standard version 1.1) is used to implement the parallelisation.So for compiling the parallel executables a working MPI implementation isneeded. 16epending on the number of parallelised space-time directions the t -direction,the t - and x -direction, the t -, x - and y -direction or the t -, x - and y - and z -direction are parallelised.The number of processors per space direction must be specified at run time,i.e. in the input file. The relevant parameters are NrXProcs , NrYProcs and
NrZProcs . The number of processors in time direction is determined by theprogram automatically. Note that the extension in any direction must divideby the number of processors in this direction.In case of even/odd preconditioning further constraints have to be fulfilled:the local L z and the local product L t × L x × L y must both be even. (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:1)(cid:0)(cid:0)(cid:1)(cid:1) (cid:0)(cid:0)(cid:1)(cid:1) Fig. 4. Boundary exchange in a two dimensional parallel setup. One can see that theinternal boundary is send while the external one is received. The corners – needed forimplementing improved gauge actions like the tree-level Symanzik improved gaugeaction [34] – need a two step procedure.
The communication is organised using boundary buffer, as sketched in figure 4.The MPI setup is contained in the file mpi init.c . The corresponding functionmust be called at the beginning of a main program just after the parametersare read in, also in case of a serial run. In this function also the various
MPI Datatype s are constructed needed for the exchange of the boundary fields.The routines performing the communication for the various data types arelocated in files starting with xchange .The communication is implemented using different types of MPI functions.One implementation uses the
MPI Sendrecv function to communicate thedata. A second one uses non-blocking MPI functions and a third one persistentMPI calls. See the MPI standard for details [35]. On machines with network ca-17able of sending in several directions in parallel the non-blocking version is themost efficient one. The relevant configure switches are --with-nonblockingmpi and --with-persistentmpi , the latter of which is only available for the Diracoperator with halfspinor fields, see section 4.1.
The Dirac operator is the kernel routine of any lattice QCD application, be-cause its inverse is needed for the HMC update procedure and also for com-puting correlation functions. The inversion is usually performed by means ofiterative solvers, like the conjugate gradient algorithm, and hence the repeatedapplication of the Dirac operator to a spinor field is needed. Thus the optimi-sation of this routine deserves special attention.At some space-time point x the application of a Wilson type Dirac operatoris mainly given by φ ( x ) =( m + 4 r + iµ q γ ) ψ ( x ) − X µ =1 (cid:20) U x,µ ( r + γ µ ) ψ ( x + a ˆ µ ) + U † x − a ˆ µ,µ ( r − γ µ ) ψ ( x − a ˆ µ ) (cid:21) (22)where r is the Wilson parameter, which we set to one in the following. Themost computer time consuming part is the nearest neighbour interaction part.For this part it is useful to observe that(1 ± γ µ ) ψ has only two independent spinor components, the other two follow trivially.So only two of the components need to be computed, then to be multipliedwith the corresponding gauge field U , and then the other two components areto be reconstructed.The operation in eq. (22) must be performed for each space-time point x . If theloop over x is performed such that all elements of φ are accessed sequentially(one output stream), it is clear that the elements in ψ and U cannot beaccessed sequentially as well. This non-sequential access may lead to seriousperformance degradations due to too many cache misses, because modernprocessing units have only a very limited number of input streams available.While the ψ field is usually different from one to the next application of18he Dirac operator, the gauge field stays often the same for a large numberof applications. This is for instance so in iterative solvers, where the Diracoperator is applied O (1000) times with fixed gauge fields. Therefore it is usefulto construct a double copy of the original gauge field sorted such that theelements are accessed exactly in the order needed in the Dirac operator. Forthe price of additional memory, with this simple change one can obtain largeperformance improvements, depending on the architecture. The double copymust be updated whenever the gauge field change. This feature is available inthe code at configure time, the relevant switch is --with-gaugecopy .Above we were assuming that we run sequentially through the resulting spinorfield φ . Another possibility is to run sequentially through the source spinor field ψ . Moreover, one could split up the operation (22) following the standard trickof introducing intermediate result vectors ϕ ± with only two spinor componentsper lattice site. Concentrating on the hopping part only, we would have ϕ + ( x, µ ) = P → µ U x,µ ( r + γ µ ) ψ ( x ) ϕ − ( x, µ ) = P → − µ ( r − γ µ ) ψ ( x ) . (23)From ϕ ± we can then reconstruct the resulting spinor field as φ ( x ) = X µ P → µ ϕ + ( x + a ˆ µ, µ )+ X µ P → − µ U † x − a ˆ µ,µ ϕ − ( x − a ˆ µ, µ ) (24)Here we denote with P → ± µ the projection to the two independent spinor com-ponents for 1 ± γ µ and with P → ± µ the corresponding reconstruction. The halfspinor fields ϕ ± can be interlaced in memory such that ψ ( x ) as well as ϕ ± ( x )are always accessed sequentially in memory. The same is possible for the gaugefields, as explained above. So only for φ we cannot avoid strided access. So farwe have only introduced extra fields ϕ ± , which need to be loaded and storedfrom and to main memory, and divided the Dirac operator into two steps(23) and (24) which are very balanced with regard to memory bandwidth andfloating point operations.The advantage of this implementation of the Dirac operator comes in theparallel case. In step (23) we need only elements of ψ ( x ), which are locallyavailable on each node. So this step can be performed without any commu-nication. In between step (23) and (24) one then needs to communicate partof ϕ ± , however only half the amount is needed compared to a communicationof ψ . After the second step there is then no further communication needed.Hence, one can reduce the amount of data to be sent by a factor of two.There is yet another performance improvement possible with this form of theDirac operator, this time for the price of precision. One can store the interme-diate fields ϕ ± with reduced precision, e.g. in single precision when the regular19pinor fields are in double precision. This will lead to a result with reducedprecision, however, in a situation where this is not important, as for instance inthe MD update procedure, it reduces the data to be communicated by anotherfactor of two. And the required memory bandwidth is reduced as well. Thisversion of the hopping matrix (currently it is only implemented for the hoppingmatrix) is available at configure time with the switch --enable-halfspinor .The reduced precision version (sloppy precision) is available through the inputparameter UseSloppyPrecision . It will be used in the MD update whereappropriate. Moreover, it is implemented in the CG iterative solver followingthe ideas outlined in Ref. [36] for the overlap operator.The various implementations of the Dirac operator can be found in the file
D psi.c and – as needed for even/odd preconditioning – the hopping matrixin the file
Hopping Matrix.c . There are many different versions of these tworoutines available, each optimised for a particular architecture, e.g. for theBlue Gene/P double hummer processor or the streaming SIMD extensions ofmodern PC processors (SSE2 and SSE3), see also Ref. [37]. Martin L¨uscherhas made available his standard C and SSE/SSE2 Dirac operator [38] underthe GNU General Public License, which are partly included into the tmLQCDpackage.
The IBM PowerPC 450d processor used on the Blue Gene architecture pro-vides a dual FPU, which supports a set of SIMD operations working on 32special registers useful for lattice QCD. These operations can be accessed us-ing build in functions of the IBM XLC compiler. The file bgl.h contains allmacros relevant for the Blue Gene version of the hopping matrix and the Diracoperator.A small fraction of half spinor version (see above) is given in algorithm 1,which represents the operation ϕ + = κ U P → (1 + γ ) ψ . After loading thecomponents of ψ into the special registers and prefetching the gauge field forthe next direction (in this case 1 + γ ), P → (1 + γ ) ψ is performed. It is thenimportant to load the gauge field U only once from memory to registers andmultiply both spinor components in parallel.Finally the result is multiplied with κ (which inherits also a phase factor dueto the way we implement the boundary conditions, see next sub-section) andstored in memory. 20 lgorithm 1 ϕ + = κ U P → (1 + γ ) ψ // load components of ψ into registers bgl load rs0((*s).s0); bgl load rs1((*s).s1); bgl load rs2((*s).s2); bgl load rs3((*s).s3); // prefetch gauge field for next direction (1 + γ ) prefetch su3(U+1); // do now first P → (1 + γ ) ψ bgl vector add rs2 to rs0 reg0(); bgl vector add rs3 to rs1 reg1(); //now multiply both components at once with gauge field U and κ bgl su3 multiply double((*U)); bgl vector cmplx mul double(ka0); // store the result bgl store reg0 up((*phi[ix]).s0); bgl store reg1 up((*phi[ix]).s1); As discussed previously, we allow for arbitrary phase factors in the boundaryconditions of the fermion fields. This is conveniently implemented in the Diracoperator as a phase factor in the hopping term X µ (cid:20) e iθ µ π/L µ U x,µ ( r + γ µ ) ψ ( x + a ˆ µ ) + e − iθ µ π/L µ U † x − a ˆ µ,µ ( r − γ µ ) ψ ( x − a ˆ µ ) (cid:21) . The relevant input parameters are
ThetaT , ThetaX , ThetaY , ThetaZ . We assume in the following that the action to be simulated can be written as S = S G + N monomials X i =1 S PF i , and we call – following the CHROMA notation [39] – each term in this sum a monomial . We require that there is exactly one gauge monomial S G (which weidentify with S in the following) and an arbitrary number of pseudo fermionmonomials S PF i .As a data type every monomial must known how to compute its contributionto the initial Hamiltonian H at the beginning of each trajectory in the heat-bath step. Then it must know how to compute the derivative with respect to21 ig. 5. Data type monomial and its components the gauge fields for given gauge field and pseudo fermion field needed for theMD update. And finally there must be a function to compute its contributionto the final Hamiltonian H ′ as used in the acceptance step.In addition for each monomial it needs to be known on which timescale itshould be integrated. The corresponding data type is sketched in figure 5.The general definitions for this data type can be found in the file monomial.c .There are several sorts of monomials implemented: • DET : pseudo fermion representation of the (mass degenerate) simple deter-minant det( Q ( κ ) + µ ) • DETRATIO : pseudo fermion representation of the determinant ratiodet( Q ( κ ) + µ ) / det( Q ( κ ) + µ ) • NDPOLY : polynomial representation of the (possibly non-degenerate) doublet[det( Q nd (¯ ǫ, ¯ µ ) )] / . • GAUGE : β X x c X µ,ν =11 ≤ µ<ν { − Re Tr( U × x,µ,ν ) } + c X µ,ν =1 µ = ν { − Re Tr( U × x,µ,ν ) } , The parameter c can be set in the input file and c = 1 − c . Note that c = 0 corresponds to the Wilson plaquette gauge action.The corresponding specific functions are defined in the files det monomial.c , detratio monomial.c , ndpoly monomial.c and gauge monomial.c . Additional22 lgorithm 2 integrate Require: < n ts ≤ N ts , τ > ∆ τ = τ / noSteps[ n ts ] for i = 0 to noSteps[ n ts ] do if n ts == 1 then updateGauge(∆ τ ) else integrate( n ts −
1, ∆ τ ) end if updateMomenta(∆ τ , monomialList[ n ts ]) end for monomials can easily be implemented by providing the corresponding func-tions as discussed above.The integration scheme is implemented recursively, as exemplified in algo-rithm 2 for the leap-frog integration scheme (where we skipped half steps forsimplicity). The updateMomenta function simply calls the derivative func-tions of all monomials that are integrated on timescale n ts and updates themomenta P according to the time step ∆ τ .The recursive scheme for the integration can easily be extended to more in-volved integration schemes. The details can be found in the file integrator.c .We have implemented the leap-frog and the second order minimal norm [22]integrations schemes. They are named in the input file as LEAPFROG and ,respectively. These two can be mixed on different timescales. In addition wehave implemented a position version of the second order minimal norm inte-gration scheme, denoted by in the input file. The latter mustnot be mixed with the former two.The MD update is summarised in algorithm 3. It computes the initial andfinal Hamiltonians and calls in between the integration function with the totalnumber of timescales N ts and the total trajectory length τ . As shortly discussed previously, as long as the integration in the MD update isreversible and area preserving there is large freedom in choosing the integrationscheme, but also the operator: it is not necessary to use the Dirac operatorhere, it can be any approximation to it. This is only useful if the acceptancerate is not strongly affected by such an approximation.The code provides two possibilities to adapt the precision of the Dirac op-erator used in the MD update: the first is to reduce the precision in theinversions needed for the force computation. This causes reduced iteration23 lgorithm 3
MD update H = H ′ = 0 for i = 0 to N monomials do H += monomial[ i ] → heat-bath-function end for integrate( N ts , τ ) for i = 0 to N monomials do H ′ += monomial[ i ] → acceptance-function end for accept with probability min { , exp( − ∆ H ) } numbers needed for the integration of one trajectory. The relevant input pa-rameter is ForcePrecision available for each monomial. The precision neededin the acceptance and/or heatbath step can be adjusted separately using
AcceptancePrecision . It is advisable to have the acceptance precision al-ways close to machine precision.The second possibility for influencing the Dirac operator is given by the re-duced precision Dirac operator described in sub-section 4.1, which is switchedon with the
UseSloppyPrecision input parameter. The two possibilities canalso be used in parallel.Note that one should always test for reversibility violations as explained insub-section 4.3.
The idea of the chronological solver method, or chronological solver guess(CSG) (or similar methods [40]) is to optimise the initial guess for the solu-tion used in the solver. To this end the history of N CSG last solutions of theequation M χ = φ is saved and then a linear combination of the fields χ i withcoefficients c i is used as an initial guess for the next inversion. M stands forthe operator to be inverted and has to be replaced by the different ratios ofoperators used in this paper.The coefficients c i are determined by solving X i χ † j M χ i c i = χ † j φ (25)with respect to the coefficients c i . This is equivalent to minimising the func-tional that is minimised by the CG inverter itself.The downside of this method is that the reversibility violations increase signif-icantly by one or two orders of magnitude in the Hamiltonian when the CSGis switched on and all other parameters are kept fixed. Therefore one has to24djust the residues in the solvers, which increases the number of matrix vec-tor multiplications again. Our experience is that the methods described in theprevious sub-section are more effective in particular in the context of multipletime scale integration, because the CSG is most effective for small values of∆ τ .The input parameters is the CSGHistory parameter available for the relevantmonomials. Setting it to zero means no chronological solver, otherwise thisparameter specifies the number of last solutions N CSG to be saved.
The HMC program includes the possibility to perform a certain number ofmeasurements after every trajectory online , whether or not the configurationis stored on disk. Some of those are performed per default, namely all that arewritten to the output file output.data :(1) the plaquette expectation value, defined as: h P i = 16 V X µ,ν =1 1 ≤ µ<ν Re Tr( U × x,µ,ν ) , where V is the global lattice volume.(2) the rectangle expectation value, defined as: h R i = 112 V X µ,ν =1 µ = ν Re Tr( U × x,µ,ν )(3) ∆ H = H ′ − H and exp( − ∆ H ).See the overview section for details about the output.data file. These observ-ables all come with no extra computational cost.Optionally, other online measurements can be performed, which – however –need in general extra inversions of the Dirac operator. First of all the compu-tation of certain correlation functions is implemented. They need one extrainversion of the Dirac operator, as discussed in Ref. [41], using the one-end-trick. Define a stochastic source ξ as followslim R →∞ [ ξ ∗ i ξ j ] = δ ij , lim R →∞ [ ξ i ξ j ] = 0 . (26)Here R labels the number of samples and i all other degrees of freedom. Then[ φ r ∗ i φ rj ] R = M − ∗ ik · M − jk + noise , (27)25f φ was computed from φ rj = M − jk ξ rk . Having in mind the γ -hermiticity property of the Wilson and Wilson twistedmass Dirac propagator G u,d , i.e. G u ( x, y ) = γ G d ( y, x ) † γ it is clear that eq. (27) can be used to evaluate C π ( t ) = h Tr[ G u (0 , t ) γ G d ( t, γ ] i = h Tr[ G u (0 , t ) G u (0 , t ) † ] i with only one inversion. But, even if the one gamma structure at the sourceis fixed to be γ due to the γ -hermiticity trick, we are still free to insert any γ -structure Γ at the source, i.e. we can evaluate any correlation function ofthe form C P Γ ( t ) = h Tr[ G u (0 , t ) γ G d ( t, i = h Tr[ G u (0 , t ) G u (0 , t ) † γ Γ] i . Useful combinations of correlation functions are h P P i , h P A i and h P V i , with P α = ¯ χγ τ α χ , V αµ = ¯ χγ µ τ α χ , A αµ = ¯ χγ γ µ τ α χ From h P P i one can extract the pseudo scalar mass, and – in the twistedmass case – the pseudo scalar decay constant. h P A i can be used together with h P P i to extract the so called PCAC quark mass and h P V i to measure therenormalisation constant Z V . For details we refer the reader to Ref. [41].These online measurements are controlled with the two following input param-eters: PerformOnlineMeasurements to switch them on or off and to specifythe frequency
OnlineMeasurementsFreq . The three correlation functions aresaved in files named onlinemeas.n , where n is the trajectory number. Everyfile contains five columns, specifying the type, the operator type and the Eu-clidean time t . The last two columns are the values of the correlation functionitself, C ( t ) and C ( − t ), respectively. The type is equal to 1, 2 or 6 for the h P P i ,the h P A i and the h P V i correlation functions. The operator type is for onlinemeasurements always equal to 1 for local source and sink (no smearing of anykind), and the time runs from 0 to T /
2. Hence, C ( − t ) = C ( T − t ). C ( −
0) and C ( − T /
2) are set to zero for convenience.In addition to correlation functions also the minimal and the maximal eigen-values of the ( γ D ) can be measured.An online measurement not related to physics, but related to the algorithmare checks of reversibility violations. The HMC algorithm is exact if and onlyif the integration scheme is reversible. On a computer with finite precision thisis only guaranteed up to machine precision. These violations can be estimated26y integrating one trajectory forward and then backward in Monte Carlo time.The difference δ ∆ H among the original Hamiltonian H and the final one H ′′ after integrating back can serve as one measure for those violations, anotherone is provided by the difference among the original gauge field U and thefinal one U ′′ δ ∆ U = 112 V X x,µ X i,j ( U x,µ − U ′′ x,µ ) i,j where we indicate with the δ ∆ that this is obtained after integrating a tra-jectory forward and backward in time. The results for δ ∆ H and δ ∆ U arestored in the file return check.data . The relevant input parameters are ReversibilityCheck and
ReversibilityCheckInterval . There are several iterative solvers implemented in the tmLQCD package forsolving
D χ = φ for χ . The minimal residual (MR), the conjugate gradient (CG), the con-jugate gradient squared (CGS), the generalised minimal residual (GMRES),the generalised conjugate residual and the stabilised bi-conjugate gradient(BiCGstab). For details regarding these algorithms we refer to Refs. [33,42].For the hmc tm executable only the CG and the BiCGstab solvers are available,while all the others can be used in the invert executables. Most of them areboth available with and without even/odd preconditioning. For a performancecomparison we refer to Ref. [43,36].The stopping criterion is implemented in two ways: the first is an absolutestopping criterion, i.e. the solver is stopped when the squared norm of theresidual vector (depending on the solver this might be the iterated residual orthe real residual) fulfils k r k < ǫ . The second is relative to the source vector, i.e. k r k k φ k < ǫ . The value of ǫ and the choice of relative or absolute precision can be influencedvia input parameters.The reduced precision Dirac operator, as discussed in sub-section 4.1, is avail-able for the CG solver. In the CG solver the full precision Dirac operator isonly required at the beginning of the CG search, because the relative size of27he contribution to the resulting vector decreases with the number of itera-tions. Thus, as soon as a certain precision is achieved in the CG algorithmwe can switch to the reduced precision Dirac operator without spoiling theprecision of the final result. We switch to the lower precision operator at aprecision of √ ǫ in the CG search, when aiming for a final precision of ǫ < γ D ) is theso called Jacobi-Davidson method [44,45]. For a discussion for the applicationof this algorithm to lattice QCD we refer again to Ref. [43,36].All solver related files can be found in the sub-directory solver . Note thatthere are a few more solvers implemented which are, however, in an experi-mental status. Smearing techniques have become an important tool to reduce ultraviolet fluc-tuations in the gauge fields. One of those techniques, coming with the advan-tage of being usable in the MD update, is usually called stout smearing [46].The ( n + 1) th level of stout smeared gauge links is obtained iteratively fromthe n th level by U ( n +1) µ ( x ) = e i Q ( n ) µ ( x ) U ( n ) µ ( x ) . We refer to the unsmeared (“thin”) gauge field as U µ ≡ U (0) µ . The SU(3)matrices Q µ are defined via the staples C µ : Q ( n ) µ ( x ) = i (cid:20) U ( n ) µ ( x ) C ( n ) µ † ( x ) − h . c . (cid:21) − i (cid:20) U ( n ) µ ( x ) C ( n ) µ † ( x ) − h . c . (cid:21) ,C ( n ) µ = X ν = µ ρ µν (cid:18) U ( n ) ν ( x ) U ( n ) µ ( x + ˆ ν ) U ( n ) ν † ( x + ˆ µ )+ U ( n ) ν † ( x − ˆ ν ) U ( n ) µ ( x − ˆ ν ) U ( n ) ν ( x − ˆ ν + ˆ µ ) (cid:19) , where in general ρ µν is the smearing matrix. In the tmLQCD package we haveonly implemented isotropic 4-dimensional smearing, i.e., ρ µν = ρ .Currently stout smearing is only implemented for the invert executables. I.e.the gauge field can be stout smeared at the beginning of an inversion. The28nput parameters are UseStoutSmearing , StoutRho and
StoutNoIterations . The random number generator used in the code is the one proposed by MartinL¨uscher and usually known under the name RANLUX [47]. A single and doubleprecision implementation was made available by the author under the GNUGeneral Public License and can be downloaded [48]. For convenience it is alsoincluded in the tmLQCD package.
In this final subsection we specify the IO formats used to store gauge config-urations, propagators and sources to disk.
For gauge configurations we use the International Lattice Data Grid (ILDG)standard as specified in Ref. [49,50]. As the lime packaging library [51] andILDG standard allow additional – not required – records to be stored withinthe file, we currently add the following two records for convenience:(1) xlf-info : useful information about the gauge configuration, such as theplaquette value, and about the run and the algorithm and the code versionused to generate it.(2) scidac-checksum : SCIDAC checksum of the gauge configuration. For thespecification see [52].The gauge configurations can be written to disk either in single or doubleprecision. The relevant input parameter is
GaugeConfigWritePrecision . Onreadin the precision is determined automatically.Note that the gauge configuration does not depend on the particular choiceof the γ -matrices. We note at the beginning, that we do not use different IO formats for sourceor sink fermion fields. They are both stored using the same lime records.The meta-data stored in the same lime-packed file is supposed to clarify all29ther things. It is also important to realise that the propagator depends onthe γ -matrix convention used in the Dirac operator. For our convention seeappendix A.Here we mainly concentrate on storing propagators (sink). The file can containonly sources, or both, source and sink. We (plan to) support four differentformats(1) (arbitrary number of) sink, no sources(2) (arbitrary number of) source/sink pairs(3) one source, 12 sink(4) one source, 4 sinkThis is very similar to the formats in use in parts of the US lattice community.We adopt the SCIDAC checksum [52] for the binary data.Source and sink binary data has to be in a separate lime record. The order inone file for the four formats mentioned above is supposed to be(1) sink, no sources: -(2) source/sink pairs: first source, then sink(3) one source, 12 sink: first source, then 12 sinks(4) one source, 4 sink: first source, then 4 sinksAll fermion field files must have a record indicating its type. The record itselfis of type propagator-type and the record has a single entry (ASCII string)which contains one of • DiracFermion Sink • DiracFermion Source Sink Pairs • DiracFermion ScalarSource TwelveSink • DiracFermion ScalarSource FourSink
Those strings are also used in the input files for the input parameter
PropagatorType .The binary data corresponding to one Dirac fermion field (source or sink) isthen stored with at least two (three) records. The first is of type etmc-propagator-format and contains the following information:
This type is stored in a record called source-type in the lime file. Theremight be several sources stored within the same file. We add a format record etmc-source-format looking like
In order to compile the programmes the
LAPACK [53] library (Fortran ver-sion) needs to be installed. In addition it must be known which linker op-tions are needed to link against
LAPACK , e.g. -Lpath-to-lapack -llapack-lblas . Also a the latest version (tested is version 1.2.3) of
C-LIME [51] mustbe available, which is used as a packaging scheme to read and write gaugeconfigurations and propagators to files.
In order to get a simple configuration of the hmc package it is enough to justtype patch-to-src-code/configure --with-lime=
CC, F77 and
CFLAGS are not specified, configure will guess them. 32he code was successfully compiled and run at least on the following platforms:i686 and compatible, x64 and compatible, IBM Regatta systems, IBM BlueGene/L, IBM Blue Gene/P, SGI Altix and SGI PC clusters and powerpcclusters.The configure script accepts certain options to influence the building proce-dure. One can get an overview over all supported options with configure--help . There are enable|disable options switching on and off optional fea-tures and with|without switches usually related to optional packages. In thefollowing we describe the most important of them (check configure --help for the defaults and more options): • --enable-mpi :This option switches on the support for MPI. On certain platforms it auto-matically chooses the correct parallel compiler or searches for a command mpicc in the search path. • --enable-p4 :Enable the use of special Pentium4 instruction set and cache management. • --enable-opteron :Enable the use of special opteron instruction set and cache management. • --enable-sse2 :Enable the use of SSE2 instruction set. This is a huge improvement onPentium4 and equivalent systems. • --enable-sse3 :Enable the use of SSE3 instruction set. This will give another 20% ofspeedup when compared to only SSE2. However, only a few processors arecapable of SSE3. • --enable-gaugecopy :See section 4.1 for details on this option. It will increase the memory re-quirement of the code. • --enable-halfspinor :If this option is enabled the Dirac operator using half spinor fields is used.See sub-section 4.1 for details. If this feature is switched on, also the gaugecopy feature is switched on automatically. • --with-mpidimension=n :This option has only effect if the code is configured for MPI usage. Thenumber of parallel directions can be specified. 1,2,3 and 4 dimensional par-allelisation is supported. • --with-lapack="
177 0 .
17 0 . κµ q .
177 0 .
01 0 . κ ¯ µ − . − κ ¯ ǫ − . −h P i . . . h R i − . . The configure script will guess at the very beginning on which platform thebuild is done. In case this fails or a cross compilation must be performed pleaseuse the option --host=HOST . For instance in order to compile for the BG/Pone needs to specify --host=ppc-ibm-bprts --build=ppc64-ibm-linux .For certain architectures like the Blue Gene systems there are
README.arch files in the top source directory with example configure calls.
After successfully configuring the package the code can be build by simplytyping make in the build directory. This will compile the standard executables.Typing make install will copy these executables into the install directory.The default install directory is $HOME/bin , which can be influenced e.g. withthe --prefix option to configure . The source code ships with a number of sample input files. They are locatedin the sample-input sub-directory. They are small volume V = 4 test runsdesignated to measure for instance the average plaquette values.Such a test-run can be performed for instance on a scalar machine by typing34 /hmc tm -f sample-hmc0.input .Depending on the environment you are running in, you may need to adjustthe input parameters to match the maximal run-time and so on. The expectedaverage plaquette values are quoted in table 1 and also in the sample inputfiles. Another useful test executable is a benchmark code. It can be build by typing make benchmark and it will, when run, measure the performance of the Diracoperator. It can be run in the serial or parallel case. It reads its input from afile benchmark.input and the relevant input parameters are the following:
L = 4T = 4NrXProcs = 2NrYProcs = 2NrZProcs = 2UseEvenOdd = yesUseSloppyPrecision = no
In case of even/odd preconditioning the performance of the hopping matrix isevaluated, in case of no even/odd the performance of the Dirac operator. Theimportant part of the output of the code is as follows [...](1429 Mflops [64 bit arithmetic])communication switched off(2592 Mflops [64 bit arithmetic])The size of the package is 36864 ByteThe bandwidth is 662.91 + 662.91 MB/sec
The bandwidth is not measured directly but computed from the performancedifference among with and without communication and the package size. Incase of a serial run the output is obviously reduced.35
Acknowledgements
This code was started based on a HMC implementation for Wilson fermionswith mass preconditioning kindly provided by Martin Hasenbusch. Many dis-cussions with, and contributions by R´emi Baron, Thomas Chiarappa, AlbertDeuzeman, Roberto Frezzotti, Martin Hasenbusch, Gregorio Herdoiza, Ma-rina Marinkovic, Craig McNeile, Istvan Montvay, Andreas Nube, David Palao,Siebren Reker, Andrea Shindler, Jan Volkholz and Urs Wenger are gratefullyacknowledged. We thank Peter Boyle for useful discussions on the efficientimplementation of the Dirac operator..
References [1]
CP-PACS and JLQCD
Collaboration, A. Ukawa, Nucl. Phys. Proc. Suppl. , 195 (2002).[2] C. Urbach, K. Jansen, A. Shindler and U. Wenger, Comput. Phys. Commun. , 87 (2006), hep-lat/0506011 .[3] M. L¨uscher, Comput. Phys. Commun. , 199 (2005), hep-lat/0409106 .[4] M. A. Clark and A. D. Kennedy, Phys. Rev. Lett. , 051601 (2007), arXiv:hep-lat/0608015 .[5] M. A. Clark and A. D. Kennedy, Phys. Rev. D75 , 011502 (2007), arXiv:hep-lat/0610047 .[6] A. Ali Khan et al. , Nucl. Phys. Proc. Suppl. , 853 (2004), hep-lat/0309078 .[7]
QCDSF
Collaboration, A. Ali Khan et al. , Phys. Lett.
B564 , 235 (2003), hep-lat/0303026 .[8]
TrinLat
Collaboration, W. Kamleh and M. J. Peardon, PoS
LAT2005 , 106(2006).[9] M. L¨uscher, JHEP , 011 (2007), arXiv:0710.5417 [hep-lat] .[10] .[11] C.-N. Yang and R. L. Mills, Phys. Rev. , 191 (1954).[12] K. G. Wilson, Phys. Rev. D10 , 2445 (1974).[13]
ALPHA
Collaboration, R. Frezzotti, P. A. Grassi, S. Sint and P. Weisz, JHEP , 058 (2001), hep-lat/0101001 .[14] R. Frezzotti and G. C. Rossi, Nucl. Phys. Proc. Suppl. , 193 (2004), hep-lat/0311008 .
15] T. Chiarappa et al. , Eur. Phys. J.
C50 , 373 (2007), arXiv:hep-lat/0606011 .[16] S. Duane, A. D. Kennedy, B. J. Pendleton and D. Roweth, Phys. Lett.
B195 ,216 (1987).[17] M. Hasenbusch, Phys. Lett.
B519 , 177 (2001), hep-lat/0107019 .[18] M. Hasenbusch and K. Jansen, Nucl. Phys.
B659 , 299 (2003), hep-lat/0211042 .[19] M. Hasenbusch, Nucl. Phys. Proc. Suppl. , 27 (2004), arXiv:hep-lat/0310029 .[20]
ALPHA
Collaboration, M. Della Morte et al. , Comput. Phys. Commun. ,62 (2003), hep-lat/0307008 .[21]
TrinLat
Collaboration, M. J. Peardon and J. Sexton, Nucl. Phys. Proc. Suppl. , 985 (2003), hep-lat/0209037 .[22] T. Takaishi and P. de Forcrand, hep-lat/0505020 .[23] J. C. Sexton and D. H. Weingarten, Nucl. Phys.
B380 , 665 (1992).[24] R. Frezzotti and K. Jansen, Phys. Lett.
B402 , 328 (1997), hep-lat/9702016 .[25] R. Frezzotti and K. Jansen, Nucl. Phys.
B555 , 395 (1999), hep-lat/9808011 .[26] R. Frezzotti and K. Jansen, Nucl. Phys.
B555 , 432 (1999), hep-lat/9808038 .[27] M. L¨uscher, Nucl. Phys.
B418 , 637 (1994), arXiv:hep-lat/9311007 .[28] B. Bunk, S. Elser, R. Frezzotti and K. Jansen, Comput. Phys. Commun. ,95 (1999), arXiv:hep-lat/9805026 .[29] W. Press, S. Teukolsky, W. Vetterling and B. Flannery,
Numerical Recipes inC , 2nd ed. (Cambridge University Press, Cambridge, UK, 1992).[30] T. A. DeGrand and P. Rossi, Comput. Phys. Commun. , 211 (1990).[31] K. Jansen and C. Liu, Comput. Phys. Commun. , 221 (1997), hep-lat/9603008 .[32] C. T. Sachrajda and G. Villadoro, Phys. Lett. B609 , 73 (2005), arXiv:hep-lat/0411033 .[33] Y. Saad,
Iterative Methods for sparse linear systems , 2nd ed. (SIAM, 2003).[34] P. Weisz, Nucl. Phys.
B212 , 1 (1983).[35] .[36] T. Chiarappa et al. , Comput. Sci. Disc. , 015001 (2008), arXiv:hep-lat/0609023 .[37] M. L¨uscher, Nucl. Phys. Proc. Suppl. , 21 (2002), arXiv:hep-lat/0110007 .
38] M. L¨uscher, http://luscher.web.cern.ch/luscher/QCDpbm/ .[39]
SciDAC
Collaboration, R. G. Edwards and B. Joo, Nucl. Phys. Proc. Suppl. , 832 (2005), hep-lat/0409003 .[40] R. C. Brower, A. R. Levi and K. Orginos, Nucl. Phys. Proc. Suppl. , 855(1995), hep-lat/9412004 .[41] ETM
Collaboration, P. Boucaud et al. , Comput. Phys. Commun. , 695(2008), arXiv:0803.0224 [hep-lat] .[42] A. Meister,
Numerik linearer Gleichungssysteme (vieweg, 1999).[43] T. Chiarappa et al. , Nucl. Phys. Proc. Suppl. , 853 (2005), arXiv:hep-lat/0409107 .[44] G. L. G. Sleijpen and H. A. V. der Vorst, SIAM Journal on Matrix Analysisand Applications , 401 (1996).[45] R. Geus, The Jacobi-Davidson algorithm for solving large sparse symmetriceigenvalue problems with application to the design of accelerator cavities , PhDthesis, Swiss Federal Institute Of Technology Z¨urich, 2002.[46] C. Morningstar and M. J. Peardon, Phys. Rev.
D69 , 054501 (2004), arXiv:hep-lat/0311018 .[47] M. L¨uscher, Comput. Phys. Commun. , 100 (1994), arXiv:hep-lat/9309020 .[48] M. L¨uscher, http://luscher.web.cern.ch/luscher/ranlux/ .[49] http://cssm.sasr.edu.au/ildg/ .[50] T. Yoshie, PoS LATTICE2008 , 019 (2008), arXiv:0812.0849 [hep-lat] .[51] USQCD, http://usqcd.jlab.org/usqcd-docs/c-lime/ .[52] .[53] .[54] . γ and Pauli Matrices In the following we specify our conventions for γ - and Pauli-matrices. A.1 γ -matrices We use the following convention for the Dirac γ -matrices: γ = − − − − , γ = − i − i
00 + i i ,γ = −
10 0 +1 00 +1 0 0 − , γ = − i
00 0 0 + i + i − i . In this representation γ is diagonal and reads γ = +1 0 0 00 +1 0 00 0 − − . A.2 Pauli-matrices
For the Pauli-matrices acting in flavour space we use the following convention:1 f = , τ = , τ = − ii , τ = − Even/Odd Preconditioning
B.1 HMC Update
In this section we describe how even/odd [30,31] preconditioning can be usedin the HMC algorithm in presence of a twisted mass term. Even/odd precon-ditioning is implemented in the tmLQCD package in the HMC algorithm aswell as in the inversion of the Dirac operator, and can be used optionally.We start with the lattice fermion action in the hopping parameter represen-tation in the χ -basis written as S [ χ, ¯ χ, U ] = X x ¯ χ ( x )[1 + 2 iκµγ τ ] χ ( x ) − κ ¯ χ ( x ) X µ =1 (cid:20) U ( x, µ )( r + γ µ ) χ ( x + a ˆ µ )+ U † ( x − a ˆ µ, µ )( r − γ µ ) χ ( x − a ˆ µ ) (cid:21) ≡ X x,y ¯ χ ( x ) M xy χ ( y ) (B.1)similar to eq. (4). For convenience we define ˜ µ = 2 κµ . Using the matrix M one can define the hermitian (two flavour) operator: Q ≡ γ M = Q + Q − (B.2)where the sub-matrices Q ± can be factorised as follows (Schur decomposition): Q ± = γ ± i ˜ µγ M eo M oe ± i ˜ µγ = γ M ± ee M eo M oe M ± oo = γ M ± ee γ M oe M ± ee ) − M eo γ ( M ± oo − M oe ( M ± ee ) − M eo ) . (B.3)Note that ( M ± ee ) − can be computed to be(1 ± i ˜ µγ ) − = 1 ∓ i ˜ µγ µ . (B.4)40sing det( Q ) = det( Q + ) det( Q − ) the following relation can be deriveddet( Q ± ) ∝ det( ˆ Q ± )ˆ Q ± = γ ( M ± oo − M oe ( M ± ee ) − M eo ) , (B.5)where ˆ Q ± is only defined on the odd sites of the lattice. In the HMC algorithmthe determinant is stochastically estimated using pseudo fermion fields φ o :det( ˆ Q + ˆ Q − ) = Z D φ o D φ † o exp( − S PF ) S PF ≡ φ † o (cid:16) ˆ Q + ˆ Q − (cid:17) − φ o , (B.6)where the fields φ o are defined only on the odd sites of the lattice. In orderto compute the force corresponding to the effective action S PF we need thevariation of S PF with respect to the gauge fields (using δ ( A − ) = − A − δAA − ): δS PF = − [ φ † o ( ˆ Q + ˆ Q − ) − δ ˆ Q + ( ˆ Q + ) − φ o + φ † o ( ˆ Q − ) − δ ˆ Q − ( ˆ Q + ˆ Q − ) − φ o ]= − [ X † o δ ˆ Q + Y o + Y † o δ ˆ Q − X o ] (B.7)with X o and Y o defined on the odd sides as X o = ( ˆ Q + ˆ Q − ) − φ o , Y o = ( ˆ Q + ) − φ o = ˆ Q − X o , (B.8)where ( ˆ Q ± ) † = ˆ Q ∓ has been used. The variation of ˆ Q ± reads δ ˆ Q ± = γ (cid:16) − δM oe ( M ± ee ) − M eo − M oe ( M ± ee ) − δM eo (cid:17) , (B.9)and one finds δS PF = − ( X † δQ + Y + Y † δQ − X )= − ( X † δQ + Y + ( X † δQ + Y ) † ) (B.10)where X and Y are now defined over the full lattice as X = − ( M − ee ) − M eo X o X o , Y = − ( M + ee ) − M eo Y o Y o . (B.11)In addition δQ + = δQ − , M † eo = γ M oe γ and M † oe = γ M eo γ have been used.Since the bosonic part is quadratic in the φ o fields, the φ o are generated atthe beginning of each molecular dynamics trajectory with φ o = ˆ Q + r o , (B.12)where r o is a random spinor field taken from a Gaussian distribution withnorm one. 41 .1.1 Mass non-degenerate flavour doublet Even/odd preconditioning can also be implemented for the mass non-degenerateflavour doublet Dirac operator D h eq. (5). Denoting Q h = γ D h the even/odd decomposition is as follows Q h = ( γ + i ¯ µτ − ¯ ǫτ ) Q heo Q hoe ( γ + i ¯ µτ − ¯ ǫτ ) = Q hee Q hoe · Q hee ) − Q eo Q hoo (B.13)where Q hoo is given in flavour space by Q hoo = γ i ¯ µγ − M oe (1 − i ¯ µγ ) M eo µ − ¯ ǫ − ¯ ǫ (cid:16) M oe M eo µ − ¯ ǫ (cid:17) − ¯ ǫ (cid:16) M oe M eo µ − ¯ ǫ (cid:17) − i ¯ µγ − M oe (1 − i ¯ µγ ) M eo µ − ¯ ǫ with the previous definitions of M eo etc. The implementation for the PHMCis very similar to the mass degenerate HMC case. B.2 Inversion
In addition to even/odd preconditioning in the HMC algorithm as describedabove, it can also be used to speed up the inversion of the fermion matrix. Dueto the factorisation (B.3) the full fermion matrix can be inverted by invertingthe two matrices appearing in the factorisation M ± ee M eo M oe M ± oo − = M ± ee ) − M eo M ± oo − M oe ( M ± ee ) − M eo ) − M ± ee M oe − . The two factors can be simplified as follows: M ± ee M oe − = ( M ± ee ) − − M oe ( M ± ee ) − M ± ee ) − M eo M ± oo − M oe ( M ± ee ) − M eo ) − = − ( M ± ee ) − M eo ( M ± oo − M oe ( M ± ee ) − M eo ) − M ± oo − M oe ( M ± ee ) − M eo ) − . The complete inversion is now performed in two separate steps: first computefor a given source field φ = ( φ e , φ o ) an intermediate result ϕ = ( ϕ e , ϕ o ) by: ϕ e ϕ o = M ± ee M oe − φ e φ o = ( M ± ee ) − φ e − M oe ( M ± ee ) − φ e + φ o . This step requires only the application of M oe and ( M ± ee ) − , the latter of whichis given by eq. (B.4). The final solution ψ = ( ψ e , ψ o ) can then be computedwith ψ e ψ o = M ± ee ) − M eo M ± oo − M oe ( M ± ee ) − M eo ) − ϕ e ϕ o = ϕ e − ( M ± ee ) − M eo ψ o ψ o , where we defined ψ o = ( M ± oo − M oe ( M ± ee ) − M eo ) − ϕ o . Therefore, the only inversion that has to be performed numerically is the oneto generate ψ o from ϕ o and this inversion involves only an operator that isbetter conditioned than the original fermion operator.Even/odd preconditioning can also be used for the mass non-degenerate Diracoperator D h eq. (5). The corresponding equations follow immediately from theprevious discussion and the definition from eq. (B.13). C Initialising the PHMC
The function 1 / √ s in the interval [ ǫ,
1] can be approximated using polyno-mials or rational functions of different sorts. In the tmLQCD package we useChebysheff polynomials, which are easy to construct. They can be constructedas to provide a desired overall precision in the interval [ ǫ, P n,ǫ are neededfor the evaluation of the force. Even though the roots come in complex conju-43ate pairs, for our case the roots cannot be computed analytically, hence weneed to determine them numerically. Such an evaluation requires usually highprecision. This is why these roots need to be determined before a PHMC runusing an external program, i.e. they cannot be computed at the beginning ofa run in the hmc tm program.Such an external program ships with the tmLQCD code, which is located inthe util/laguere directory . It is based on Laguerre’s method and uses theClass Library for Numbers (CLN) [54], which provides arbitrary precision datatypes. In order to compute roots the CLN library must be available, which isfree software.Taking for granted that the CLN library is available, the procedure for com-puting the roots is as follows: assuming the non-degenerate Dirac operatorhas eigenvalues in the interval [˜ s min , ˜ s max ], i.e. ǫ = ˜ s min / ˜ s max , and the poly-nomial degree is n . Edit the file chebyRoot.H and set the variable EPSILON to the value of ǫ . Moreover, set the variable MAXPOW to the degree n . Adaptthe Makefile to your local installation and compile the code by typing make .After running the
ChebyRoot program successfully, you should find two filesin the directory(1)
Square root BR roots.dat :which contains the roots of the polynomial in bit-reverse order [24].(2) normierungLocal.dat :which contains a normalisation constant.Copy these two files into the directory where you run the code and adjust theinput parameters to match exactly the values used for the root computation.I.e. the input parameters
StildeMin , StildeMax and
DegreeOfMDPolynomial must be set appropriately in the
NDPOLY monomial.The minimal and maximal eigenvalue of the non-degenerate flavour doubletcan be computed as an online measurement. The frequency can be specifiedin the
NDPOLY monomial with the input parameter