[PDF] tmLQCD: a program suite to simulate Wilson Twisted mass Lattice QCD

Abstract

We discuss a program suite for simulating Quantum Chromodynamics on a 4-dimensional space-time lattice. The basic Hybrid Monte Carlo algorithm is introduced and a number of algorithmic improvements are explained. We then discuss the implementations of these concepts as well as our parallelisation strategy in the actual simulation code. Finally, we provide a user guide to compile and run the program.

Full PDF

aa r X i v : . [ h e p - l a t ] M a y tmLQCD: a program suite to simulate WilsonTwisted mass Lattice QCD Karl Jansen a and Carsten Urbach b , a DESY, Patanenallee 6, Zeuthen, Germany b Humboldt-Univertit¨at zu Berlin, Institut f¨ur Physik,Newtonstr. 15, 12489 Berlin, Germany

Abstract

We discuss a program suite for simulating Quantum Chromodynamics on a 4-dimensional space-time lattice. The basic Hybrid Monte Carlo algorithm is intro-duced and a number of algorithmic improvements are explained. We then discussthe implementations of these concepts as well as our parallelisation strategy in theactual simulation code. Finally, we provide a user guide to compile and run theprogram.PACS: 11.15.-q, 11.15.Ha, 11.38.-t, 11.38.GcPreprint-Numbers: DESY 09-073, HU-EP-09/23, SFB/CPP-09-43

Key words:

Hybrid Monte Carlo algorithm; Lattice QCD;

PROGRAM SUMMARY

Manuscript Title: tmLQCD: a program suite to simulate Wilson Twisted mass Lat-tice QCDAuthors: K. Jansen and C. UrbachProgram Title: tmLQCDJournal Reference:Catalogue identiﬁer:Licensing provisions:

GNU General Public License (GPL)

Programming language: C Computer: any

Operating system: any with a standard C compiler

RAM: no typical values available

Number of processors used: Corresponding author

Preprint submitted to Elsevier 29 August 2018 ne or optionally any even number of processors. Tested with up to 32768 proces-sors.

Keywords:

Hybrid Monte Carlo algorithm; Lattice QCD;

PACS:

Classiﬁcation:

External routines/libraries:

LAPACK [1] and LIME [2] library.

Nature of problem:

Quantum Chromodynamics.

Solution method:

Markov Chain Monte Carlo using the Hybrid Monte Carlo algorithm with masspreconditioning and multiple time scales [3]. Iterative solver for large systems oflinear equations.

Restrictions:

Restricted to an even number of (not necessarily mass degenerate) quark ﬂavoursin the Wilson or Wilson twisted mass formulation of lattice QCD.

Additional comments: none.

Running time:

Depending on the problem size, the architecture and the input parameters from afew minutes to weeks.

References: [1] .[2] USQCD, http://usqcd.jlab.org/usqcd-docs/c-lime/ .[3] C. Urbach, K. Jansen, A. Shindler and U. Wenger, Comput. Phys. Commun. , 87 (2006). ONG WRITE-UP1 Introduction

This contribution to the anniversary issue of CPC deals with the strong forcein Particle Physics. The strong force is presumably the least well understoodfundamental interaction between elementary particles. It is responsible for theexistence of protons and neutrons, or more generally all nuclei, as bound states.The constituents of the nuclei are the quarks and gluons as the fundamentalparticles. It is interesting to observe that the energy (mass) of a proton has asize of about 1GeV while the mass of the two constituent up and down quarksis at the order of only a few MeV. Hence, the by far biggest contribution tothe proton mass is pure binding energy.This shows already that a description of the proton in terms of the underlyingquark and gluon degrees of freedom must be highly non-trivial. The model thatis believed to provide a theoretical framework for the strong interaction andshould give such a description is Quantum Chromodynamics (QCD). Althoughthis theory can be written in a very compact mathematical form, it is a highlynon-linear theory that does not allow for a closed analytical solution.However, and rather fortunately, QCD can be reformulated in such a waythat computational physics methods can be applied to calculate propertiesof QCD from ﬁrst principles and without relying on approximations. In thisapproach, space and time are rendered discrete and a lattice spacing a isintroduced. Thus, a 4-dimensional space-time lattice is considered and thequark and gluon degrees of freedom are placed on the lattice points or onso-called links that connect lattice points. In this way one obtains a modelof “quark” spins, which are coupled to nearest neighbours only, very muchreminiscent of an Ising model in statistical physics. Indeed, the methods ofstatistical physics, namely the evaluation of the partition function by meansof numerical simulations using Monte Carlo methods employing importancesampling, are the key to address QCD on the 4-dimensional lattice, which wewill refer to as lattice QCD (LQCD).Although the concepts to treat LQCD numerically are very clear, the problemhas an intrinsically extremely high computational demand. The crucial factoris that in the end the introduced discretisation has to be removed again. Ifwe consider a lattice with say, L = 3 fm linear extent and a lattice spacing of a = 0 . N = 3 / . lattice points for such a more or less reasonable physical situation. Simu-lations on such a lattice require in LQCD already several Teraﬂops. Keeping3 ﬂops · years [1] m PS /m V (a) [2][1]Tﬂops · years m PS /m V (b)Fig. 1. Computer resources needed to generate 1000 independent conﬁgurations ofsize 24 ×

40 at a lattice spacing of about 0 .

08 fm with pure Wilson fermions inunits of Tﬂops · years as a function of m PS /m V . In (a) we show the prediction ofRef. [1] from the 2001 lattice conference in Berlin (hence titled “Berlin Wall”). In(b) we compare to the formula of Ref. [1] (solid line) with the performance of thealgorithm described in this paper and ﬁrst published in Ref. [2]. The dashed line isto guide the eye. The scale of the vertical axis changes by about a factor of 1 / T = 40. L = 3 fm ﬁxed and decreasing the lattice spacing to remove the discretisationby, say, a factor of two increases the cost of the simulation already by a factor2 . As this would not be worse enough, the used algorithms contribute anotherfactor of 2 (2 − . Hence, going to really ﬁne discretisations where the eﬀects ofthe non-zero lattice spacing can safely be neglected or at least a controlledextrapolation to zero values of the lattice spacing can be performed is an ex-tremely demanding computational challenge which will ﬁnally require at leastPetaﬂops computing, an area of computing power we are realising today.However, even with the advent of Petaﬂops computers, the goal of “solving”QCD on a lattice would be completely out of reach without some algorithmicimprovements that were invented in recent years. This is shown in ﬁg. 1. Inthe left panel of the graph, the number of Teraﬂops years for a certain typicalsimulation is shown as a function of the ratio of two meson masses, the pseudoscalar and the vector meson. The graph derives from the known cost of theused algorithm in the year 2001 [1].It is important to realise that the meson mass ratio used in ﬁg. 1 assumes avalue of about 0.2 in nature as measured in experiments, which is indicatedby the arrow in both panels of ﬁg. 1. The ﬁgure clearly demonstrates the4trongly growing cost of the simulations when the real physical situation isto be reached. In fact, simulations directly at the physical value of this massratio were impossible in 2001. The right panel of the graph demonstrates thechange of the situation when algorithmic improvements are used as the onesdescribed in this article. In fact, the simulation costs shown in the right panelof ﬁg. 1 originate from direct performance measurements of the code that isdescribed here. As the ﬁgure demonstrates, although the simulations at thephysical value of the meson mass ratio are still rather demanding, they becomeclearly realistic with todays Petaﬂops systems. There are other approaches toimprove the HMC algorithm with similar results [3,4,5,6,7,8]. Very promising isthe recent additional improvement using inexact deﬂation presented in Ref. [9].The algorithmic improvements provided therefore a tremendous gain openinga way for simulations in LQCD that were unthinkable a few years ago. Itis precisely the goal of this contribution to describe one programme versionof the underlying Hybrid Monte Carlo (HMC) algorithm where a number ofsuch improvements have been incorporated and to make the correspondingcode publicly available. The current version of the code and future updatescan be downloaded from the web [10]. Quantum Chromodynamics on a hyper-cubic Euclidean space-time lattice ofsize L × T with lattice spacing a is formally described by the action S = S G [ U ] + a X x ¯ ψ D [ U ] ψ (1)with S G some suitable discretisation of the the Yang-Mills action F µν / D is a discretisation of the Dirac operator, for which Wilsonoriginally proposed [12] to use the so called Wilson Dirac operator D W [ U ] = 12 h γ µ (cid:16) ∇ µ + ∇ ∗ µ (cid:17) − a ∇ ∗ µ ∇ µ i (2)5ith ∇ µ and ∇ ∗ µ the forward and backward gauge covariant diﬀerence opera-tors, respectively: ∇ µ ψ ( x ) = 1 a (cid:20) U ( x, x + a ˆ µ ) ψ ( x + a ˆ µ ) − ψ ( x ) (cid:21) , ∇ ∗ µ ψ ( x ) = 1 a (cid:20) ψ ( x ) − U † ( x, x − a ˆ µ ) ψ ( x − a ˆ µ ) (cid:21) , (3)where we denote the SU(3) link variables by U x,µ . We shall set a ≡ D tm = ( D W [ U ] + m ) 1 f + iµ q γ τ (4)for a mass degenerate doublet of quarks. We denote by m the bare (Wilson)quark mass, µ q is the bare twisted mass parameter, τ i the i -th Pauli matrix and1 f the unit matrix acting in ﬂavour space (see appendix A for our convention).In the framework of Wilson twisted mass QCD only ﬂavour doublets of quarkscan be simulated, however, the two quarks do not need to be degenerate inmass. The corresponding mass non-degenerate ﬂavour doublet reads [14] D h (¯ µ, ¯ ǫ ) = D W f + i ¯ µγ τ + ¯ ǫτ . (5)Note that this notation is not unique. Equivalently – as used in Ref. [15] – onemay write D ′ h ( µ σ , µ δ ) = D W · f + iγ µ σ τ + µ δ τ , (6)which is related to D h by D ′ h = (1 + iτ ) D h (1 − iτ ) / µ σ , µ δ ) → (¯ µ, − ¯ ǫ ). For the purpose of introducing the Hybrid Monte Carlo (HMC) algorithmwe shall consider only the Wilson twisted mass formulation of lattice QCDwith one q doublet of mass degenerate quarks with bare quark mass m andbare twisted mass µ q . The extension to more than one ﬂavour doublet ofquarks is straightforward. The corresponding polynomial HMC algorithm usedfor simulating the mass non-degenerate ﬂavour doublet is discussed in thefollowing sub-section.After integrating out the Grassmann valued fermion ﬁelds, in lattice QCD oneneeds to evaluate the integral Z D U det( Q † Q ) e − S G , (7)6y Markov Chain Monte Carlo methods with some discretisation of the Yang-Mills gauge action S G and Q ≡ γ D W [ U ] + γ m + iµ q , (8)with the Wilson-Dirac operator D W of eq. (2). Note that Q acts now on oneﬂavour only. The determinant can be re-expressed using complex valued, so-called pseudo fermion ﬁelds φ and φ † det( Q ) ∝ Z D φ D φ † e − ( Q − φ,Q − φ ) (9)where S PF ≡ | Q − φ | is called the pseudo fermion action. The HMC algo-rithm [16] is then deﬁned by introducing traceless hermitian momenta P x,µ (conjugate to the fundamental ﬁelds U x,µ ) and a Hamiltonian H ( U, P ) = X x,µ

12 Tr[ P x,µ ] + S G [ U ] + S PF [ U ] . (10)Given H , the algorithm is composed out of a molecular dynamics update ofthe ﬁelds ( U, P ) → ( U ′ , P ′ ) and a Metropolis accept/reject step with respectto H using the acceptance probability P acc = min(1 , exp ( H ( U ′ , P ′ ) − H ( U, P )) . (11)While the momenta P are generated at the beginning of a trajectory – in theso called heat-bath step – randomly from a Gaussian distribution, the pseudofermion ﬁelds φ are generated by ﬁrst generating random ﬁelds r and then φ = Qr such that exp {− ( Q − φ, Q − φ ) } = exp { r † r } . Note that the pseudo fermionﬁelds are not evolved during the molecular dynamics part of the HMC algo-rithm. In the molecular dynamics (MD) part of the HMC algorithm the momentaand gauge ﬁelds are updated corresponding to the Hamiltonian equations ofmotion ddτ P x,µ = − F ( x, µ ) ,ddτ U x,µ = P x,µ U x,µ (12)with respect to a ﬁctitious computer time τ and forces F which are obtainedby diﬀerentiating the action with respect to the gauge ﬁelds U , and takes7alues in the Lie algebra of SU(3). The diﬀerentiation D U of some function f ( U ) is deﬁned as D aU f ( U ) = ∂∂α f ( e iαt a U ) | α =0 , where t a are the generators of su(3).Since these equations can in general not be integrated analytically, one usesnumerical integration methods, which must be area preserving and reversible.Symmetrised symplectic integrators fulﬁl these requirements with the simplestexample being the leap-frog algorithm. The basic discrete update steps withintegration step size ∆ τ of the gauge ﬁeld and the momenta can be deﬁned as T U (∆ τ ) : U → U ′ = exp ( i ∆ τ P ) U ,T S (∆ τ ) : P → P ′ = P − i ∆ τ F . (13)The leap-frog algorithm is then obtained by sequential application of T = T S (∆ τ / T U (∆ τ ) T S (∆ τ / , i.e. for a trajectory of length τ one needs to apply T N MD with N MD = τ / ∆ τ . Preconditioning is usually performed by factorisingdet( Q † Q ) = det( R † R ) · det( R † R ) · · · det( R † n R n ) (14)with suitably chosen R , R , . . . R n . Then For every R i a separate pseudofermion ﬁeld φ i is introduced, such that the Hamiltonian reads H ( U, P ) = X x,µ

12 Tr[ P x,µ ] + S G [ U ] + n X i =1 S PF i . (15)and the equations of motion are changed to ddτ P x,µ = − n X i =0 F i ( x, µ ) ddτ U x,µ = P x,µ U x,µ where we identify F with the force stemming from the gauge action S G .The factorisation in eq. (14) can be achieved in many diﬀerent ways, see forinstance Refs. [3,4,5,6,7,8]. Here we shall only discuss what is known as masspreconditioning or Hasenbusch trick [17,18,19]. It is obtained by writing theidentity det( Q † Q ) = det( W † W ) · det( Q † Q )det( W † W ) , (16)8here W = D W + m + iµ γ , µ > µ q . By adjusting the the additional mass parameter µ , the condition number of W † W and ( Q † Q ) / ( W † W ) can both be reduced with respect to the one of Q † Q alone. As argued in Ref. [20], the optimal choice leads to a condition numberof √ k for both W † W and ( Q † Q ) / ( W † W ), where k is the condition number of Q † Q . A reduced condition number leads to reduced force contributions in theMD evolution and allows hence for larger values of ∆ τ .It is important to notice that evaluating the force contribution stemming from( Q † Q ) / ( W † W ) is more expensive in terms of computer time than the evalua-tion of the contribution from W † W , since it involves the iterative solution of ϕ = ( Q † Q ) − φ with the large condition number k . Thus, the algorithm mightbe further improved by not tuning the condition numbers equal as explainedbeforehand, but by introducing a multiple time scale integration scheme asfollows.Considering a Hamiltonian like in eq. (15) we may introduce n + 1 timescales∆ τ i with ∆ τ i = τN MD i , N MD i = N n · N n − · · · N i and basic discrete update steps T U (∆ τ ) : U → U ′ = exp ( i ∆ τ P ) U ,T S i (∆ τ ) : P → P ′ = P − i ∆ τ F i (17)with 0 ≤ i ≤ n . We have identiﬁed S ≡ S G . The leap frog update on timescale i is then recursively deﬁned as T i =  T S i (∆ τ i / T U (∆ τ i ) T S i (∆ τ i / i = 0 T S i (∆ τ i /

2) [ T i − ] N i − T S i (∆ τ i /

2) 0 < i ≤ n (18)and the full trajectory of length τ is eventually achieved by [ T n ] N n .As was shown in Ref. [2] – and for other factorisations of the determinantin Refs. [3,7,21] – the combination of multiple time scale integration and adeterminant factorisation allows to set the algorithm up such that the mostexpensive operator contributes least to the MD forces. It can then be inte-grated on the outermost timescale and must be less often inverted. During the last paragraphs we have introduced the simplest reversible andarea preserving integration scheme, known as leap frog integration scheme.9here are more involved integration schemes available, partly or completelycancelling higher order discretisation errors.It turns out that completely cancelling higher order eﬀects is not necessaryand often even not eﬃcient. Integration schemes with reduced errors are forexample the so called n -th order minimal norm integration schemes, for detailssee Ref. [22] and references therein. The second order minimal norm (2MN)integration scheme is based on the update step T = T S ( λ ∆ τ ) T U (∆ τ / T S ((1 − λ )∆ τ ) T U (∆ τ / T S ( λ ∆ τ ) ,T i = T S i ( λ i ∆ τ i ) [ T i − ] N i − T S i ((1 − λ i )∆ τ i )[ T i − ] N i − T S i ( λ i ∆ τ i ) , (19) λ i is a dimensionless parameter and the 2MN scheme coincides with theSexton-Weingarten scheme [23] in case λ i = 1 /

6. The optimal value for λ i was given in Ref. [22] to be around 0 .

19. But its value is likely to depend onthe mass values and the time scale under consideration. Note that there is aparameter λ i for each timescale ∆ τ i , which can be tuned separately.While all the integration schemes introduced so far were based on the order T S T U T S , it is also possible to revert this order. In this case one talks aboutthe position version of the corresponding integration scheme, while the usualone is called the velocity version. Under certain circumstances they can bemore eﬃcient, because one less application of T S is needed. The correspondingupdate steps can be easily derived from the formulae provided above. In the framework of Wilson twisted mass fermions it is only possible to sim-ulate ﬂavour doublets of quarks. Hence, if one wants to include the strangequark in the simulation one also needs to include the charm. The correspond-ing mass non-degenerate doublet was deﬁned in equation (5). Simulating sucha ﬂavour doublet operator is possible using the polynomial HMC (PHMC) al-gorithm [24,25,26]. The basic problem that occurs in the mass non-degeneratecase is that a single ﬂavour has to be taken into account or equivalently thedeterminant of a single operator Q needs to be treated. The PHMC algorithmcan solve this problem elegantly.The idea of the PHMC is based on writingdet( Q ) = det( q Q ) ≈ det( P − ǫ,n ( Q )) ∝ Z D φ D φ † e − φ † P φ , Q is positive. P ǫ,n ( Q ) is a polynomial approximation of 1 / √ Q of degree n in the interval [ ǫ, P n,ǫ ( s ) = 1 √ s { R n,ǫ } , s = Q . (20) R n,ǫ is the error term. It can be shown that for the case of Chebysheﬀ polyno-mials | R | vanishes exponentially fast with the degree n (for large n ). For moredetails regarding this issue we refer the reader for instance to Refs.[27,28] andreferences therein.It is worth noticing that representing in inverse operator by a polynomial hasconceptual advantages. It allows to treat certain regions of the eigenvalue spec-trum of the operator in diﬀerent ways and to separate therefore the infraredfrom the bulk and ultraviolet parts of the spectrum. Although this has beenthe main underlying idea of the PHMC algorithm [24,25,26] we will use it here,however, only as a technical tool to treat single ﬂavours in the simulations.For our purpose – introducing Q h = γ D h – we can rewrite the correspondingdeterminant det( Q h ) ∝ Z D Φ † D Φ e − Φ † P n,ǫ ( s )Φ , with s = Q † h Q h and the pseudo fermion ﬁelds Φ are now two ﬂavour ﬁelds.Note that D † h = τ γ D h γ τ . The application of the polynomial P to a pseudofermion ﬁeld Φ can be performed by either using the Clenshaw recursion re-lation [29], or by using the product representation P n,ǫ ( s )Φ = " n Y i =1 c ( s − z i ) Φ ≡ B ( s ) · B ( s ) † Φwith z i the complex roots of P and a suitably chosen normalisation constant c . The product representation is conveniently used in the MD update. For thechoice of polynomials, the determination of their roots and how to order themto avoid round-oﬀ errors see appendix C.The HMC algorithm requires an area preserving and reversible MD updateprocedure, however, there is no need to use in the MD update the same op-erator as in the heat-bath step. As long as the acceptance rate is suﬃcientlyhigh, we are free to use any other operator in the update. In order to exploitthis possibility we introduce a second more precise polynomial˜ P m,δ ( s ) = 1 P n,ǫ √ s { R m,δ } (21)which is used in the heat-bath step to generate the pseudo fermion ﬁelds froma random ﬁeld R Φ = ˜

P B † Q h R P is then used only in the MD update.The polynomial degrees n, m and the approximation intervals have to be de-termined such as to guarantee a good approximation of 1 / √ s in the range ofeigenvalues of Q † h Q h . One may also adopt a strategy to chose ǫ or δ largerthan a few lowest eigenvalues of Q † h Q h and use re-weighting to correct forthis [24,25]. Even/Odd preconditioning

The (P)HMC algorithm is implemented using even/odd preconditioning [30,31],which is discussed shortly in appendix B. We want to stress that althougheven/odd preconditioning is a rather technical step, it leads to a very impor-tant improvement of the algorithm performance and is a cornerstone of allHMC implementations in the ﬁeld.

The theory is discretised and put on a ﬁnite, hyper-cubic space-time latticewith extensions L × T ≡ Q µ L µ . The boundary conditions for the gauge ﬁelds U x,µ are chosen to be periodic, i.e. U x + L ν ˆ ν,µ = U x,µ , where ˆ ν is a unit vector in direction ν . For the fermionic ﬁelds ψ ( x ) we allowfor more general boundary conditions, namely so called twisted boundaryconditions ψ ( x + L ν ˆ ν ) = e iθ ν π ψ ( x ) . Periodic boundary conditions correspond to θ ν = 0, while anti-periodic bound-ary conditions are achieved by setting θ ν = 1. More generally one can realisewith twisted boundary conditions arbitrary values of momentum transfer onthe lattice by a convenient re-interpretation of the phases [32]. The general strategy of the tmLQCD package is to provide programs for themain applications used in lattice QCD with Wilson twisted mass fermions.The code and the algorithms are designed to be general enough such as tocompile and run eﬃciently on any modern computer architecture. This is12 ig. 2. Flowchart for the hmc tm executable achieved code-wise by using standard C as programming language and forparallelisation the message passing interface (MPI) standard version 1.1.Performance improvements are achieved by providing dedicated code for cer-tain widely used architectures, like PC’s or the Blue Gene family. Dedicatedcode is mainly available for the kernel routine – the application of the Diracoperator, which will be discussed in detail in section 4.1, and for the commu-nication routines.The tmLQCD package provides three main applications. The ﬁrst is an imple-mentation of the (P)HMC algorithm, the second and the third are executablesto invert the Wilson twisted mass Dirac operator (4) and the non-degenerateWilson twisted mass Dirac operator (5), respectively. All three do have a widerange of run-time options, which can be inﬂuenced using an input ﬁle. Thesyntax of the input ﬁle is explained in the documentation which ships with thesource code. The relevant input parameters will be mentioned in the followingwhere appropriate, to ease usage.We shall ﬁrstly discuss the general layout of the three aforementioned appli-cations, followed by a general discussion of the parallelisation strategy used inall three of them. 13 .1 hmc tm In ﬁgure 2 the programme ﬂow of the hmc tm executable is depicted. In theﬁrst block the input ﬁle is parsed and parameters are set accordingly. Thenthe required memory is allocated and, depending on the input parameters,data is read from disk in order to continue a previous run.The main part of this application is the molecular dynamics update. For anumber of trajectories, which must be speciﬁed in the input ﬁle, ﬁrst a heat-bath is performed, then the integration according to the equations of motionusing the integrator as speciﬁed in the input ﬁle, and ﬁnally the acceptancestep.After each trajectory certain online measurements are performed, such as mea-suring the plaquette value. Other online measurements are optional, like mea-suring the pseudo scalar correlation function.

The programme oﬀers command line options as follows: • -h|? prints a help message and exits. • -f input ﬁle name. The default is hmc.input • -o the preﬁx of the output ﬁlenames. The default is output . The code willgenerate or append to two ﬁles, output.data and output.para . The parameters of each run are read from an input ﬁle with default name hmc.input . If it is missing all parameters will be set to their default values.Any parameter not set in the input ﬁle will also be set to its default value.During the run the hmc tm program will generate two output ﬁles, one calledper default output.data , the other one output.para . Into the latter impor-tant parameters will be written at the beginning of the run.The ﬁle output.data has several columns with the following meanings(1) Plaquette value.(2) ∆ H (3) exp( − ∆ H )(4) a pair of integers for each pseudo fermion monomial. The ﬁrst integer ofeach pair is the sum of solver iterations needed in the acceptance and14 ig. 3. Flowchart for the main part of the invert and invert doublet executables. heatbath steps, the second is the sum of iterations needed for the forcecomputation of the whole trajectory.(5) Acceptance (0 or 1).(6) Time in seconds needed for this trajectory.(7) Value of the rectangle part in the gauge action, if used.Every new run will append its numbers to an already existing ﬁle.In addition, the program will create a ﬁle history hmc tm . This ﬁle providesa mapping between the conﬁguration number and its plaquette and Polyakovloop values. Moreover the simulation parameters are stored there and in caseof a reread the time point can be found there.After every trajectory the program will save the current conﬁguration in theﬁle conf.save . invert and invert doublet The two applications invert and invert doublet are very similar. The maindiﬀerence is that in invert the one ﬂavour Wilson twisted mass Dirac op-erator is inverted, whereas in invert doublet the non-degenerate doublet isinverted.The main part of the two executables is depicted in ﬁgure 3. Each measurementcorresponds to one gauge conﬁguration that is read from disk into memory. Foreach of these gauge conﬁgurations a number of inversions will be performed.15he sources can be either generated or read in from disk. In the former casethe programme can currently generate point sources at random location inspace time. In the latter case the name of the source ﬁle can be speciﬁed inthe input ﬁle.The relevant Dirac operator is then inverted on each source and the result isstored on disk. The inversion can be performed with a number of inversionalgorithms, such as conjugate gradient (CG), BiCGstab, and others [33]. Andoptionally even/odd preconditioning as described previously can be used.

The two programmes oﬀer command line options as follows: • -h|? prints a help message and exits. • -f input ﬁle name. The default is hmc.input • -o the preﬁx of the output ﬁlenames. The default is output . The code willgenerate or append to one ﬁle called output.para . The program will create a ﬁle called output.data with information aboutthe parameters of the run. Of course, also the propagators are stored on disk.The corresponding ﬁle names can be inﬂuenced via input parameters. The ﬁleformat is discussed in some detail in sub-section 4.7.One particularity of the invert doublet program is that the propagatorswritten to disk correspond to the two ﬂavour Dirac operator of eq. (6), i.e. D ′ h ( µ σ , µ δ ) = D W · f + iµ σ τ + γ µ δ τ , essentially for compatibility reasons. For the two ﬂavour components writtenthe ﬁrst is the would be strange component and the second one the would be charm one. The whole lattice can be parallelised in up to 4 space-time directions. It iscontrolled with conﬁgure switches, see section 5.2. The Message Passing In-terface (MPI, standard version 1.1) is used to implement the parallelisation.So for compiling the parallel executables a working MPI implementation isneeded. 16epending on the number of parallelised space-time directions the t -direction,the t - and x -direction, the t -, x - and y -direction or the t -, x - and y - and z -direction are parallelised.The number of processors per space direction must be speciﬁed at run time,i.e. in the input ﬁle. The relevant parameters are NrXProcs , NrYProcs and

NrZProcs . The number of processors in time direction is determined by theprogram automatically. Note that the extension in any direction must divideby the number of processors in this direction.In case of even/odd preconditioning further constraints have to be fulﬁlled:the local L z and the local product L t × L x × L y must both be even. (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:1)(cid:0)(cid:0)(cid:1)(cid:1) (cid:0)(cid:0)(cid:1)(cid:1) Fig. 4. Boundary exchange in a two dimensional parallel setup. One can see that theinternal boundary is send while the external one is received. The corners – needed forimplementing improved gauge actions like the tree-level Symanzik improved gaugeaction [34] – need a two step procedure.

The communication is organised using boundary buﬀer, as sketched in ﬁgure 4.The MPI setup is contained in the ﬁle mpi init.c . The corresponding functionmust be called at the beginning of a main program just after the parametersare read in, also in case of a serial run. In this function also the various

MPI Datatype s are constructed needed for the exchange of the boundary ﬁelds.The routines performing the communication for the various data types arelocated in ﬁles starting with xchange .The communication is implemented using diﬀerent types of MPI functions.One implementation uses the

MPI Sendrecv function to communicate thedata. A second one uses non-blocking MPI functions and a third one persistentMPI calls. See the MPI standard for details [35]. On machines with network ca-17able of sending in several directions in parallel the non-blocking version is themost eﬃcient one. The relevant conﬁgure switches are --with-nonblockingmpi and --with-persistentmpi , the latter of which is only available for the Diracoperator with halfspinor ﬁelds, see section 4.1.

The Dirac operator is the kernel routine of any lattice QCD application, be-cause its inverse is needed for the HMC update procedure and also for com-puting correlation functions. The inversion is usually performed by means ofiterative solvers, like the conjugate gradient algorithm, and hence the repeatedapplication of the Dirac operator to a spinor ﬁeld is needed. Thus the optimi-sation of this routine deserves special attention.At some space-time point x the application of a Wilson type Dirac operatoris mainly given by φ ( x ) =( m + 4 r + iµ q γ ) ψ ( x ) − X µ =1 (cid:20) U x,µ ( r + γ µ ) ψ ( x + a ˆ µ ) + U † x − a ˆ µ,µ ( r − γ µ ) ψ ( x − a ˆ µ ) (cid:21) (22)where r is the Wilson parameter, which we set to one in the following. Themost computer time consuming part is the nearest neighbour interaction part.For this part it is useful to observe that(1 ± γ µ ) ψ has only two independent spinor components, the other two follow trivially.So only two of the components need to be computed, then to be multipliedwith the corresponding gauge ﬁeld U , and then the other two components areto be reconstructed.The operation in eq. (22) must be performed for each space-time point x . If theloop over x is performed such that all elements of φ are accessed sequentially(one output stream), it is clear that the elements in ψ and U cannot beaccessed sequentially as well. This non-sequential access may lead to seriousperformance degradations due to too many cache misses, because modernprocessing units have only a very limited number of input streams available.While the ψ ﬁeld is usually diﬀerent from one to the next application of18he Dirac operator, the gauge ﬁeld stays often the same for a large numberof applications. This is for instance so in iterative solvers, where the Diracoperator is applied O (1000) times with ﬁxed gauge ﬁelds. Therefore it is usefulto construct a double copy of the original gauge ﬁeld sorted such that theelements are accessed exactly in the order needed in the Dirac operator. Forthe price of additional memory, with this simple change one can obtain largeperformance improvements, depending on the architecture. The double copymust be updated whenever the gauge ﬁeld change. This feature is available inthe code at conﬁgure time, the relevant switch is --with-gaugecopy .Above we were assuming that we run sequentially through the resulting spinorﬁeld φ . Another possibility is to run sequentially through the source spinor ﬁeld ψ . Moreover, one could split up the operation (22) following the standard trickof introducing intermediate result vectors ϕ ± with only two spinor componentsper lattice site. Concentrating on the hopping part only, we would have ϕ + ( x, µ ) = P → µ U x,µ ( r + γ µ ) ψ ( x ) ϕ − ( x, µ ) = P → − µ ( r − γ µ ) ψ ( x ) . (23)From ϕ ± we can then reconstruct the resulting spinor ﬁeld as φ ( x ) = X µ P → µ ϕ + ( x + a ˆ µ, µ )+ X µ P → − µ U † x − a ˆ µ,µ ϕ − ( x − a ˆ µ, µ ) (24)Here we denote with P → ± µ the projection to the two independent spinor com-ponents for 1 ± γ µ and with P → ± µ the corresponding reconstruction. The halfspinor ﬁelds ϕ ± can be interlaced in memory such that ψ ( x ) as well as ϕ ± ( x )are always accessed sequentially in memory. The same is possible for the gaugeﬁelds, as explained above. So only for φ we cannot avoid strided access. So farwe have only introduced extra ﬁelds ϕ ± , which need to be loaded and storedfrom and to main memory, and divided the Dirac operator into two steps(23) and (24) which are very balanced with regard to memory bandwidth andﬂoating point operations.The advantage of this implementation of the Dirac operator comes in theparallel case. In step (23) we need only elements of ψ ( x ), which are locallyavailable on each node. So this step can be performed without any commu-nication. In between step (23) and (24) one then needs to communicate partof ϕ ± , however only half the amount is needed compared to a communicationof ψ . After the second step there is then no further communication needed.Hence, one can reduce the amount of data to be sent by a factor of two.There is yet another performance improvement possible with this form of theDirac operator, this time for the price of precision. One can store the interme-diate ﬁelds ϕ ± with reduced precision, e.g. in single precision when the regular19pinor ﬁelds are in double precision. This will lead to a result with reducedprecision, however, in a situation where this is not important, as for instance inthe MD update procedure, it reduces the data to be communicated by anotherfactor of two. And the required memory bandwidth is reduced as well. Thisversion of the hopping matrix (currently it is only implemented for the hoppingmatrix) is available at conﬁgure time with the switch --enable-halfspinor .The reduced precision version (sloppy precision) is available through the inputparameter UseSloppyPrecision . It will be used in the MD update whereappropriate. Moreover, it is implemented in the CG iterative solver followingthe ideas outlined in Ref. [36] for the overlap operator.The various implementations of the Dirac operator can be found in the ﬁle

D psi.c and – as needed for even/odd preconditioning – the hopping matrixin the ﬁle

Hopping Matrix.c . There are many diﬀerent versions of these tworoutines available, each optimised for a particular architecture, e.g. for theBlue Gene/P double hummer processor or the streaming SIMD extensions ofmodern PC processors (SSE2 and SSE3), see also Ref. [37]. Martin L¨uscherhas made available his standard C and SSE/SSE2 Dirac operator [38] underthe GNU General Public License, which are partly included into the tmLQCDpackage.

The IBM PowerPC 450d processor used on the Blue Gene architecture pro-vides a dual FPU, which supports a set of SIMD operations working on 32special registers useful for lattice QCD. These operations can be accessed us-ing build in functions of the IBM XLC compiler. The ﬁle bgl.h contains allmacros relevant for the Blue Gene version of the hopping matrix and the Diracoperator.A small fraction of half spinor version (see above) is given in algorithm 1,which represents the operation ϕ + = κ U P → (1 + γ ) ψ . After loading thecomponents of ψ into the special registers and prefetching the gauge ﬁeld forthe next direction (in this case 1 + γ ), P → (1 + γ ) ψ is performed. It is thenimportant to load the gauge ﬁeld U only once from memory to registers andmultiply both spinor components in parallel.Finally the result is multiplied with κ (which inherits also a phase factor dueto the way we implement the boundary conditions, see next sub-section) andstored in memory. 20 lgorithm 1 ϕ + = κ U P → (1 + γ ) ψ // load components of ψ into registers bgl load rs0((*s).s0); bgl load rs1((*s).s1); bgl load rs2((*s).s2); bgl load rs3((*s).s3); // prefetch gauge ﬁeld for next direction (1 + γ ) prefetch su3(U+1); // do now ﬁrst P → (1 + γ ) ψ bgl vector add rs2 to rs0 reg0(); bgl vector add rs3 to rs1 reg1(); //now multiply both components at once with gauge ﬁeld U and κ bgl su3 multiply double((*U)); bgl vector cmplx mul double(ka0); // store the result bgl store reg0 up((*phi[ix]).s0); bgl store reg1 up((*phi[ix]).s1); As discussed previously, we allow for arbitrary phase factors in the boundaryconditions of the fermion ﬁelds. This is conveniently implemented in the Diracoperator as a phase factor in the hopping term X µ (cid:20) e iθ µ π/L µ U x,µ ( r + γ µ ) ψ ( x + a ˆ µ ) + e − iθ µ π/L µ U † x − a ˆ µ,µ ( r − γ µ ) ψ ( x − a ˆ µ ) (cid:21) . The relevant input parameters are

ThetaT , ThetaX , ThetaY , ThetaZ . We assume in the following that the action to be simulated can be written as S = S G + N monomials X i =1 S PF i , and we call – following the CHROMA notation [39] – each term in this sum a monomial . We require that there is exactly one gauge monomial S G (which weidentify with S in the following) and an arbitrary number of pseudo fermionmonomials S PF i .As a data type every monomial must known how to compute its contributionto the initial Hamiltonian H at the beginning of each trajectory in the heat-bath step. Then it must know how to compute the derivative with respect to21 ig. 5. Data type monomial and its components the gauge ﬁelds for given gauge ﬁeld and pseudo fermion ﬁeld needed for theMD update. And ﬁnally there must be a function to compute its contributionto the ﬁnal Hamiltonian H ′ as used in the acceptance step.In addition for each monomial it needs to be known on which timescale itshould be integrated. The corresponding data type is sketched in ﬁgure 5.The general deﬁnitions for this data type can be found in the ﬁle monomial.c .There are several sorts of monomials implemented: • DET : pseudo fermion representation of the (mass degenerate) simple deter-minant det( Q ( κ ) + µ ) • DETRATIO : pseudo fermion representation of the determinant ratiodet( Q ( κ ) + µ ) / det( Q ( κ ) + µ ) • NDPOLY : polynomial representation of the (possibly non-degenerate) doublet[det( Q nd (¯ ǫ, ¯ µ ) )] / . • GAUGE : β X x  c X µ,ν =11 ≤ µ<ν { − Re Tr( U × x,µ,ν ) } + c X µ,ν =1 µ = ν { − Re Tr( U × x,µ,ν ) }  , The parameter c can be set in the input ﬁle and c = 1 − c . Note that c = 0 corresponds to the Wilson plaquette gauge action.The corresponding speciﬁc functions are deﬁned in the ﬁles det monomial.c , detratio monomial.c , ndpoly monomial.c and gauge monomial.c . Additional22 lgorithm 2 integrate Require: < n ts ≤ N ts , τ > ∆ τ = τ / noSteps[ n ts ] for i = 0 to noSteps[ n ts ] do if n ts == 1 then updateGauge(∆ τ ) else integrate( n ts −

1, ∆ τ ) end if updateMomenta(∆ τ , monomialList[ n ts ]) end for monomials can easily be implemented by providing the corresponding func-tions as discussed above.The integration scheme is implemented recursively, as exempliﬁed in algo-rithm 2 for the leap-frog integration scheme (where we skipped half steps forsimplicity). The updateMomenta function simply calls the derivative func-tions of all monomials that are integrated on timescale n ts and updates themomenta P according to the time step ∆ τ .The recursive scheme for the integration can easily be extended to more in-volved integration schemes. The details can be found in the ﬁle integrator.c .We have implemented the leap-frog and the second order minimal norm [22]integrations schemes. They are named in the input ﬁle as LEAPFROG and ,respectively. These two can be mixed on diﬀerent timescales. In addition wehave implemented a position version of the second order minimal norm inte-gration scheme, denoted by in the input ﬁle. The latter mustnot be mixed with the former two.The MD update is summarised in algorithm 3. It computes the initial andﬁnal Hamiltonians and calls in between the integration function with the totalnumber of timescales N ts and the total trajectory length τ . As shortly discussed previously, as long as the integration in the MD update isreversible and area preserving there is large freedom in choosing the integrationscheme, but also the operator: it is not necessary to use the Dirac operatorhere, it can be any approximation to it. This is only useful if the acceptancerate is not strongly aﬀected by such an approximation.The code provides two possibilities to adapt the precision of the Dirac op-erator used in the MD update: the ﬁrst is to reduce the precision in theinversions needed for the force computation. This causes reduced iteration23 lgorithm 3

MD update H = H ′ = 0 for i = 0 to N monomials do H += monomial[ i ] → heat-bath-function end for integrate( N ts , τ ) for i = 0 to N monomials do H ′ += monomial[ i ] → acceptance-function end for accept with probability min { , exp( − ∆ H ) } numbers needed for the integration of one trajectory. The relevant input pa-rameter is ForcePrecision available for each monomial. The precision neededin the acceptance and/or heatbath step can be adjusted separately using

AcceptancePrecision . It is advisable to have the acceptance precision al-ways close to machine precision.The second possibility for inﬂuencing the Dirac operator is given by the re-duced precision Dirac operator described in sub-section 4.1, which is switchedon with the

UseSloppyPrecision input parameter. The two possibilities canalso be used in parallel.Note that one should always test for reversibility violations as explained insub-section 4.3.

The idea of the chronological solver method, or chronological solver guess(CSG) (or similar methods [40]) is to optimise the initial guess for the solu-tion used in the solver. To this end the history of N CSG last solutions of theequation M χ = φ is saved and then a linear combination of the ﬁelds χ i withcoeﬃcients c i is used as an initial guess for the next inversion. M stands forthe operator to be inverted and has to be replaced by the diﬀerent ratios ofoperators used in this paper.The coeﬃcients c i are determined by solving X i χ † j M χ i c i = χ † j φ (25)with respect to the coeﬃcients c i . This is equivalent to minimising the func-tional that is minimised by the CG inverter itself.The downside of this method is that the reversibility violations increase signif-icantly by one or two orders of magnitude in the Hamiltonian when the CSGis switched on and all other parameters are kept ﬁxed. Therefore one has to24djust the residues in the solvers, which increases the number of matrix vec-tor multiplications again. Our experience is that the methods described in theprevious sub-section are more eﬀective in particular in the context of multipletime scale integration, because the CSG is most eﬀective for small values of∆ τ .The input parameters is the CSGHistory parameter available for the relevantmonomials. Setting it to zero means no chronological solver, otherwise thisparameter speciﬁes the number of last solutions N CSG to be saved.

The HMC program includes the possibility to perform a certain number ofmeasurements after every trajectory online , whether or not the conﬁgurationis stored on disk. Some of those are performed per default, namely all that arewritten to the output ﬁle output.data :(1) the plaquette expectation value, deﬁned as: h P i = 16 V X µ,ν =1 1 ≤ µ<ν Re Tr( U × x,µ,ν ) , where V is the global lattice volume.(2) the rectangle expectation value, deﬁned as: h R i = 112 V X µ,ν =1 µ = ν Re Tr( U × x,µ,ν )(3) ∆ H = H ′ − H and exp( − ∆ H ).See the overview section for details about the output.data ﬁle. These observ-ables all come with no extra computational cost.Optionally, other online measurements can be performed, which – however –need in general extra inversions of the Dirac operator. First of all the compu-tation of certain correlation functions is implemented. They need one extrainversion of the Dirac operator, as discussed in Ref. [41], using the one-end-trick. Deﬁne a stochastic source ξ as followslim R →∞ [ ξ ∗ i ξ j ] = δ ij , lim R →∞ [ ξ i ξ j ] = 0 . (26)Here R labels the number of samples and i all other degrees of freedom. Then[ φ r ∗ i φ rj ] R = M − ∗ ik · M − jk + noise , (27)25f φ was computed from φ rj = M − jk ξ rk . Having in mind the γ -hermiticity property of the Wilson and Wilson twistedmass Dirac propagator G u,d , i.e. G u ( x, y ) = γ G d ( y, x ) † γ it is clear that eq. (27) can be used to evaluate C π ( t ) = h Tr[ G u (0 , t ) γ G d ( t, γ ] i = h Tr[ G u (0 , t ) G u (0 , t ) † ] i with only one inversion. But, even if the one gamma structure at the sourceis ﬁxed to be γ due to the γ -hermiticity trick, we are still free to insert any γ -structure Γ at the source, i.e. we can evaluate any correlation function ofthe form C P Γ ( t ) = h Tr[ G u (0 , t ) γ G d ( t, i = h Tr[ G u (0 , t ) G u (0 , t ) † γ Γ] i . Useful combinations of correlation functions are h P P i , h P A i and h P V i , with P α = ¯ χγ τ α χ , V αµ = ¯ χγ µ τ α χ , A αµ = ¯ χγ γ µ τ α χ From h P P i one can extract the pseudo scalar mass, and – in the twistedmass case – the pseudo scalar decay constant. h P A i can be used together with h P P i to extract the so called PCAC quark mass and h P V i to measure therenormalisation constant Z V . For details we refer the reader to Ref. [41].These online measurements are controlled with the two following input param-eters: PerformOnlineMeasurements to switch them on or oﬀ and to specifythe frequency

OnlineMeasurementsFreq . The three correlation functions aresaved in ﬁles named onlinemeas.n , where n is the trajectory number. Everyﬁle contains ﬁve columns, specifying the type, the operator type and the Eu-clidean time t . The last two columns are the values of the correlation functionitself, C ( t ) and C ( − t ), respectively. The type is equal to 1, 2 or 6 for the h P P i ,the h P A i and the h P V i correlation functions. The operator type is for onlinemeasurements always equal to 1 for local source and sink (no smearing of anykind), and the time runs from 0 to T /

2. Hence, C ( − t ) = C ( T − t ). C ( −

0) and C ( − T /

2) are set to zero for convenience.In addition to correlation functions also the minimal and the maximal eigen-values of the ( γ D ) can be measured.An online measurement not related to physics, but related to the algorithmare checks of reversibility violations. The HMC algorithm is exact if and onlyif the integration scheme is reversible. On a computer with ﬁnite precision thisis only guaranteed up to machine precision. These violations can be estimated26y integrating one trajectory forward and then backward in Monte Carlo time.The diﬀerence δ ∆ H among the original Hamiltonian H and the ﬁnal one H ′′ after integrating back can serve as one measure for those violations, anotherone is provided by the diﬀerence among the original gauge ﬁeld U and theﬁnal one U ′′ δ ∆ U = 112 V X x,µ X i,j ( U x,µ − U ′′ x,µ ) i,j where we indicate with the δ ∆ that this is obtained after integrating a tra-jectory forward and backward in time. The results for δ ∆ H and δ ∆ U arestored in the ﬁle return check.data . The relevant input parameters are ReversibilityCheck and

ReversibilityCheckInterval . There are several iterative solvers implemented in the tmLQCD package forsolving

D χ = φ for χ . The minimal residual (MR), the conjugate gradient (CG), the con-jugate gradient squared (CGS), the generalised minimal residual (GMRES),the generalised conjugate residual and the stabilised bi-conjugate gradient(BiCGstab). For details regarding these algorithms we refer to Refs. [33,42].For the hmc tm executable only the CG and the BiCGstab solvers are available,while all the others can be used in the invert executables. Most of them areboth available with and without even/odd preconditioning. For a performancecomparison we refer to Ref. [43,36].The stopping criterion is implemented in two ways: the ﬁrst is an absolutestopping criterion, i.e. the solver is stopped when the squared norm of theresidual vector (depending on the solver this might be the iterated residual orthe real residual) fulﬁls k r k < ǫ . The second is relative to the source vector, i.e. k r k k φ k < ǫ . The value of ǫ and the choice of relative or absolute precision can be inﬂuencedvia input parameters.The reduced precision Dirac operator, as discussed in sub-section 4.1, is avail-able for the CG solver. In the CG solver the full precision Dirac operator isonly required at the beginning of the CG search, because the relative size of27he contribution to the resulting vector decreases with the number of itera-tions. Thus, as soon as a certain precision is achieved in the CG algorithmwe can switch to the reduced precision Dirac operator without spoiling theprecision of the ﬁnal result. We switch to the lower precision operator at aprecision of √ ǫ in the CG search, when aiming for a ﬁnal precision of ǫ < γ D ) is theso called Jacobi-Davidson method [44,45]. For a discussion for the applicationof this algorithm to lattice QCD we refer again to Ref. [43,36].All solver related ﬁles can be found in the sub-directory solver . Note thatthere are a few more solvers implemented which are, however, in an experi-mental status. Smearing techniques have become an important tool to reduce ultraviolet ﬂuc-tuations in the gauge ﬁelds. One of those techniques, coming with the advan-tage of being usable in the MD update, is usually called stout smearing [46].The ( n + 1) th level of stout smeared gauge links is obtained iteratively fromthe n th level by U ( n +1) µ ( x ) = e i Q ( n ) µ ( x ) U ( n ) µ ( x ) . We refer to the unsmeared (“thin”) gauge ﬁeld as U µ ≡ U (0) µ . The SU(3)matrices Q µ are deﬁned via the staples C µ : Q ( n ) µ ( x ) = i (cid:20) U ( n ) µ ( x ) C ( n ) µ † ( x ) − h . c . (cid:21) − i (cid:20) U ( n ) µ ( x ) C ( n ) µ † ( x ) − h . c . (cid:21) ,C ( n ) µ = X ν = µ ρ µν (cid:18) U ( n ) ν ( x ) U ( n ) µ ( x + ˆ ν ) U ( n ) ν † ( x + ˆ µ )+ U ( n ) ν † ( x − ˆ ν ) U ( n ) µ ( x − ˆ ν ) U ( n ) ν ( x − ˆ ν + ˆ µ ) (cid:19) , where in general ρ µν is the smearing matrix. In the tmLQCD package we haveonly implemented isotropic 4-dimensional smearing, i.e., ρ µν = ρ .Currently stout smearing is only implemented for the invert executables. I.e.the gauge ﬁeld can be stout smeared at the beginning of an inversion. The28nput parameters are UseStoutSmearing , StoutRho and

StoutNoIterations . The random number generator used in the code is the one proposed by MartinL¨uscher and usually known under the name RANLUX [47]. A single and doubleprecision implementation was made available by the author under the GNUGeneral Public License and can be downloaded [48]. For convenience it is alsoincluded in the tmLQCD package.

In this ﬁnal subsection we specify the IO formats used to store gauge conﬁg-urations, propagators and sources to disk.

For gauge conﬁgurations we use the International Lattice Data Grid (ILDG)standard as speciﬁed in Ref. [49,50]. As the lime packaging library [51] andILDG standard allow additional – not required – records to be stored withinthe ﬁle, we currently add the following two records for convenience:(1) xlf-info : useful information about the gauge conﬁguration, such as theplaquette value, and about the run and the algorithm and the code versionused to generate it.(2) scidac-checksum : SCIDAC checksum of the gauge conﬁguration. For thespeciﬁcation see [52].The gauge conﬁgurations can be written to disk either in single or doubleprecision. The relevant input parameter is

GaugeConfigWritePrecision . Onreadin the precision is determined automatically.Note that the gauge conﬁguration does not depend on the particular choiceof the γ -matrices. We note at the beginning, that we do not use diﬀerent IO formats for sourceor sink fermion ﬁelds. They are both stored using the same lime records.The meta-data stored in the same lime-packed ﬁle is supposed to clarify all29ther things. It is also important to realise that the propagator depends onthe γ -matrix convention used in the Dirac operator. For our convention seeappendix A.Here we mainly concentrate on storing propagators (sink). The ﬁle can containonly sources, or both, source and sink. We (plan to) support four diﬀerentformats(1) (arbitrary number of) sink, no sources(2) (arbitrary number of) source/sink pairs(3) one source, 12 sink(4) one source, 4 sinkThis is very similar to the formats in use in parts of the US lattice community.We adopt the SCIDAC checksum [52] for the binary data.Source and sink binary data has to be in a separate lime record. The order inone ﬁle for the four formats mentioned above is supposed to be(1) sink, no sources: -(2) source/sink pairs: ﬁrst source, then sink(3) one source, 12 sink: ﬁrst source, then 12 sinks(4) one source, 4 sink: ﬁrst source, then 4 sinksAll fermion ﬁeld ﬁles must have a record indicating its type. The record itselfis of type propagator-type and the record has a single entry (ASCII string)which contains one of • DiracFermion Sink • DiracFermion Source Sink Pairs • DiracFermion ScalarSource TwelveSink • DiracFermion ScalarSource FourSink

Those strings are also used in the input ﬁles for the input parameter

PropagatorType .The binary data corresponding to one Dirac fermion ﬁeld (source or sink) isthen stored with at least two (three) records. The ﬁrst is of type etmc-propagator-format and contains the following information: diracFermion32144 lz>44 The flavours entry must be set to for a one ﬂavour propagator (ﬂavour diag-onal case) and to for a two ﬂavour propagator (ﬂavour non-diagonal 2-ﬂavouroperator). In the former case there follows one record of type scidac-binary-data ,which is identical to the SCIDAC format, containing the fermion ﬁeld. In thelatter case there follow two of such records, the ﬁrst of which is the upperﬂavour. To be precise, lets call the two ﬂavours s (strange) and c (charm).Then we always store the s component ﬁrst and then the c component.The ﬁrst two types are by now supported in the tmLQCD package. In thefuture the other two might follow.The indices (time, space, spin, colour) in the binary data scidac-binary-data are in the following order: t, z, y, x, s, c , where t is the slowest and colour the fastest running index. The binary datais stored big endian and either in single or in double precision, depending onthe precision entry in the etmc-propagator-format record.In addition we store an additional record called inverter-info with usefulinformation about the inversion precision, the physical parameters and thecode version. Source ﬁelds are, as mentioned before, stored with the same binary data for-mat. There are again several types of source ﬁles possible: • DiracFermion Source • DiracFermion ScalarSource • DiracFermion FourScalarSource • DiracFermion TwelveScalarSource

This type is stored in a record called source-type in the lime ﬁle. Theremight be several sources stored within the same ﬁle. We add a format record etmc-source-format looking like diracFermion321 lx>444443 with obvious meaning for every scidac-binary-data record within the limepacked ﬁle. This format record also allows to store a subset of the whole ﬁeld,e.g. a time-slice. The software ships with a GNU autoconf environment and a conﬁgure script,which will generate GNU Makeﬁles to build the programmes. It is supportedand recommended to conﬁgure and build the executables in a separate builddirectory. This also allows to have several builds with diﬀerent options fromthe same source code directory.

In order to compile the programmes the

LAPACK [53] library (Fortran ver-sion) needs to be installed. In addition it must be known which linker op-tions are needed to link against

LAPACK , e.g. -Lpath-to-lapack -llapack-lblas . Also a the latest version (tested is version 1.2.3) of

C-LIME [51] mustbe available, which is used as a packaging scheme to read and write gaugeconﬁgurations and propagators to ﬁles.

In order to get a simple conﬁguration of the hmc package it is enough to justtype patch-to-src-code/configure --with-lime= \--with-lapack= CC= \F77= CFLAGS= in the build directory. If

CC, F77 and

CFLAGS are not speciﬁed, configure will guess them. 32he code was successfully compiled and run at least on the following platforms:i686 and compatible, x64 and compatible, IBM Regatta systems, IBM BlueGene/L, IBM Blue Gene/P, SGI Altix and SGI PC clusters and powerpcclusters.The conﬁgure script accepts certain options to inﬂuence the building proce-dure. One can get an overview over all supported options with configure--help . There are enable|disable options switching on and oﬀ optional fea-tures and with|without switches usually related to optional packages. In thefollowing we describe the most important of them (check configure --help for the defaults and more options): • --enable-mpi :This option switches on the support for MPI. On certain platforms it auto-matically chooses the correct parallel compiler or searches for a command mpicc in the search path. • --enable-p4 :Enable the use of special Pentium4 instruction set and cache management. • --enable-opteron :Enable the use of special opteron instruction set and cache management. • --enable-sse2 :Enable the use of SSE2 instruction set. This is a huge improvement onPentium4 and equivalent systems. • --enable-sse3 :Enable the use of SSE3 instruction set. This will give another 20% ofspeedup when compared to only SSE2. However, only a few processors arecapable of SSE3. • --enable-gaugecopy :See section 4.1 for details on this option. It will increase the memory re-quirement of the code. • --enable-halfspinor :If this option is enabled the Dirac operator using half spinor ﬁelds is used.See sub-section 4.1 for details. If this feature is switched on, also the gaugecopy feature is switched on automatically. • --with-mpidimension=n :This option has only eﬀect if the code is conﬁgured for MPI usage. Thenumber of parallel directions can be speciﬁed. 1,2,3 and 4 dimensional par-allelisation is supported. • --with-lapack="" :the code requires lapack to be linked. All linker ﬂags necessary to do somust be speciﬁed here. Note that LIBS="..." works similar. • --with-limedir=

:Tells conﬁgure where to ﬁnd the lime package, which is required for thebuild of the HMC. It is used for the ILDG ﬁle format.33 R TR TR input-ﬁle sample-hmc0.input sample-hmc2.input sample-hmc3.input L × T × × × S G Wilson TlSym Iwasaki β . . . κ .

177 0 .

17 0 . κµ q .

177 0 .

01 0 . κ ¯ µ − . − κ ¯ ǫ − . −h P i . . . h R i − . . The conﬁgure script will guess at the very beginning on which platform thebuild is done. In case this fails or a cross compilation must be performed pleaseuse the option --host=HOST . For instance in order to compile for the BG/Pone needs to specify --host=ppc-ibm-bprts --build=ppc64-ibm-linux .For certain architectures like the Blue Gene systems there are

README.arch ﬁles in the top source directory with example conﬁgure calls.

After successfully conﬁguring the package the code can be build by simplytyping make in the build directory. This will compile the standard executables.Typing make install will copy these executables into the install directory.The default install directory is $HOME/bin , which can be inﬂuenced e.g. withthe --prefix option to configure . The source code ships with a number of sample input ﬁles. They are locatedin the sample-input sub-directory. They are small volume V = 4 test runsdesignated to measure for instance the average plaquette values.Such a test-run can be performed for instance on a scalar machine by typing34 /hmc tm -f sample-hmc0.input .Depending on the environment you are running in, you may need to adjustthe input parameters to match the maximal run-time and so on. The expectedaverage plaquette values are quoted in table 1 and also in the sample inputﬁles. Another useful test executable is a benchmark code. It can be build by typing make benchmark and it will, when run, measure the performance of the Diracoperator. It can be run in the serial or parallel case. It reads its input from aﬁle benchmark.input and the relevant input parameters are the following:

L = 4T = 4NrXProcs = 2NrYProcs = 2NrZProcs = 2UseEvenOdd = yesUseSloppyPrecision = no

In case of even/odd preconditioning the performance of the hopping matrix isevaluated, in case of no even/odd the performance of the Dirac operator. Theimportant part of the output of the code is as follows [...](1429 Mflops [64 bit arithmetic])communication switched off(2592 Mflops [64 bit arithmetic])The size of the package is 36864 ByteThe bandwidth is 662.91 + 662.91 MB/sec

The bandwidth is not measured directly but computed from the performancediﬀerence among with and without communication and the package size. Incase of a serial run the output is obviously reduced.35

Acknowledgements

This code was started based on a HMC implementation for Wilson fermionswith mass preconditioning kindly provided by Martin Hasenbusch. Many dis-cussions with, and contributions by R´emi Baron, Thomas Chiarappa, AlbertDeuzeman, Roberto Frezzotti, Martin Hasenbusch, Gregorio Herdoiza, Ma-rina Marinkovic, Craig McNeile, Istvan Montvay, Andreas Nube, David Palao,Siebren Reker, Andrea Shindler, Jan Volkholz and Urs Wenger are gratefullyacknowledged. We thank Peter Boyle for useful discussions on the eﬃcientimplementation of the Dirac operator..

References [1]

CP-PACS and JLQCD

Collaboration, A. Ukawa, Nucl. Phys. Proc. Suppl. , 195 (2002).[2] C. Urbach, K. Jansen, A. Shindler and U. Wenger, Comput. Phys. Commun. , 87 (2006), hep-lat/0506011 .[3] M. L¨uscher, Comput. Phys. Commun. , 199 (2005), hep-lat/0409106 .[4] M. A. Clark and A. D. Kennedy, Phys. Rev. Lett. , 051601 (2007), arXiv:hep-lat/0608015 .[5] M. A. Clark and A. D. Kennedy, Phys. Rev. D75 , 011502 (2007), arXiv:hep-lat/0610047 .[6] A. Ali Khan et al. , Nucl. Phys. Proc. Suppl. , 853 (2004), hep-lat/0309078 .[7]

QCDSF

Collaboration, A. Ali Khan et al. , Phys. Lett.

B564 , 235 (2003), hep-lat/0303026 .[8]

TrinLat

Collaboration, W. Kamleh and M. J. Peardon, PoS

LAT2005 , 106(2006).[9] M. L¨uscher, JHEP , 011 (2007), arXiv:0710.5417 [hep-lat] .[10] .[11] C.-N. Yang and R. L. Mills, Phys. Rev. , 191 (1954).[12] K. G. Wilson, Phys. Rev. D10 , 2445 (1974).[13]

ALPHA

Collaboration, R. Frezzotti, P. A. Grassi, S. Sint and P. Weisz, JHEP , 058 (2001), hep-lat/0101001 .[14] R. Frezzotti and G. C. Rossi, Nucl. Phys. Proc. Suppl. , 193 (2004), hep-lat/0311008 .

15] T. Chiarappa et al. , Eur. Phys. J.

C50 , 373 (2007), arXiv:hep-lat/0606011 .[16] S. Duane, A. D. Kennedy, B. J. Pendleton and D. Roweth, Phys. Lett.

B195 ,216 (1987).[17] M. Hasenbusch, Phys. Lett.

B519 , 177 (2001), hep-lat/0107019 .[18] M. Hasenbusch and K. Jansen, Nucl. Phys.

B659 , 299 (2003), hep-lat/0211042 .[19] M. Hasenbusch, Nucl. Phys. Proc. Suppl. , 27 (2004), arXiv:hep-lat/0310029 .[20]

ALPHA

Collaboration, M. Della Morte et al. , Comput. Phys. Commun. ,62 (2003), hep-lat/0307008 .[21]

TrinLat

Collaboration, M. J. Peardon and J. Sexton, Nucl. Phys. Proc. Suppl. , 985 (2003), hep-lat/0209037 .[22] T. Takaishi and P. de Forcrand, hep-lat/0505020 .[23] J. C. Sexton and D. H. Weingarten, Nucl. Phys.

B380 , 665 (1992).[24] R. Frezzotti and K. Jansen, Phys. Lett.

B402 , 328 (1997), hep-lat/9702016 .[25] R. Frezzotti and K. Jansen, Nucl. Phys.

B555 , 395 (1999), hep-lat/9808011 .[26] R. Frezzotti and K. Jansen, Nucl. Phys.

B555 , 432 (1999), hep-lat/9808038 .[27] M. L¨uscher, Nucl. Phys.

B418 , 637 (1994), arXiv:hep-lat/9311007 .[28] B. Bunk, S. Elser, R. Frezzotti and K. Jansen, Comput. Phys. Commun. ,95 (1999), arXiv:hep-lat/9805026 .[29] W. Press, S. Teukolsky, W. Vetterling and B. Flannery,

Numerical Recipes inC , 2nd ed. (Cambridge University Press, Cambridge, UK, 1992).[30] T. A. DeGrand and P. Rossi, Comput. Phys. Commun. , 211 (1990).[31] K. Jansen and C. Liu, Comput. Phys. Commun. , 221 (1997), hep-lat/9603008 .[32] C. T. Sachrajda and G. Villadoro, Phys. Lett. B609 , 73 (2005), arXiv:hep-lat/0411033 .[33] Y. Saad,

Iterative Methods for sparse linear systems , 2nd ed. (SIAM, 2003).[34] P. Weisz, Nucl. Phys.

B212 , 1 (1983).[35] .[36] T. Chiarappa et al. , Comput. Sci. Disc. , 015001 (2008), arXiv:hep-lat/0609023 .[37] M. L¨uscher, Nucl. Phys. Proc. Suppl. , 21 (2002), arXiv:hep-lat/0110007 .

38] M. L¨uscher, http://luscher.web.cern.ch/luscher/QCDpbm/ .[39]

SciDAC

Collaboration, R. G. Edwards and B. Joo, Nucl. Phys. Proc. Suppl. , 832 (2005), hep-lat/0409003 .[40] R. C. Brower, A. R. Levi and K. Orginos, Nucl. Phys. Proc. Suppl. , 855(1995), hep-lat/9412004 .[41] ETM

Collaboration, P. Boucaud et al. , Comput. Phys. Commun. , 695(2008), arXiv:0803.0224 [hep-lat] .[42] A. Meister,

Numerik linearer Gleichungssysteme (vieweg, 1999).[43] T. Chiarappa et al. , Nucl. Phys. Proc. Suppl. , 853 (2005), arXiv:hep-lat/0409107 .[44] G. L. G. Sleijpen and H. A. V. der Vorst, SIAM Journal on Matrix Analysisand Applications , 401 (1996).[45] R. Geus, The Jacobi-Davidson algorithm for solving large sparse symmetriceigenvalue problems with application to the design of accelerator cavities , PhDthesis, Swiss Federal Institute Of Technology Z¨urich, 2002.[46] C. Morningstar and M. J. Peardon, Phys. Rev.

D69 , 054501 (2004), arXiv:hep-lat/0311018 .[47] M. L¨uscher, Comput. Phys. Commun. , 100 (1994), arXiv:hep-lat/9309020 .[48] M. L¨uscher, http://luscher.web.cern.ch/luscher/ranlux/ .[49] http://cssm.sasr.edu.au/ildg/ .[50] T. Yoshie, PoS LATTICE2008 , 019 (2008), arXiv:0812.0849 [hep-lat] .[51] USQCD, http://usqcd.jlab.org/usqcd-docs/c-lime/ .[52] .[53] .[54] . γ and Pauli Matrices In the following we specify our conventions for γ - and Pauli-matrices. A.1 γ -matrices We use the following convention for the Dirac γ -matrices: γ =  − − − −  , γ =  − i − i

00 + i i  ,γ =  −

10 0 +1 00 +1 0 0 −  , γ =  − i

00 0 0 + i + i − i  . In this representation γ is diagonal and reads γ =  +1 0 0 00 +1 0 00 0 − −  . A.2 Pauli-matrices

For the Pauli-matrices acting in ﬂavour space we use the following convention:1 f =   , τ =   , τ =  − ii  , τ =  −  Even/Odd Preconditioning

B.1 HMC Update

In this section we describe how even/odd [30,31] preconditioning can be usedin the HMC algorithm in presence of a twisted mass term. Even/odd precon-ditioning is implemented in the tmLQCD package in the HMC algorithm aswell as in the inversion of the Dirac operator, and can be used optionally.We start with the lattice fermion action in the hopping parameter represen-tation in the χ -basis written as S [ χ, ¯ χ, U ] = X x  ¯ χ ( x )[1 + 2 iκµγ τ ] χ ( x ) − κ ¯ χ ( x ) X µ =1 (cid:20) U ( x, µ )( r + γ µ ) χ ( x + a ˆ µ )+ U † ( x − a ˆ µ, µ )( r − γ µ ) χ ( x − a ˆ µ ) (cid:21) ≡ X x,y ¯ χ ( x ) M xy χ ( y ) (B.1)similar to eq. (4). For convenience we deﬁne ˜ µ = 2 κµ . Using the matrix M one can deﬁne the hermitian (two ﬂavour) operator: Q ≡ γ M =  Q + Q −  (B.2)where the sub-matrices Q ± can be factorised as follows (Schur decomposition): Q ± = γ  ± i ˜ µγ M eo M oe ± i ˜ µγ  = γ  M ± ee M eo M oe M ± oo  =  γ M ± ee γ M oe   M ± ee ) − M eo γ ( M ± oo − M oe ( M ± ee ) − M eo )  . (B.3)Note that ( M ± ee ) − can be computed to be(1 ± i ˜ µγ ) − = 1 ∓ i ˜ µγ µ . (B.4)40sing det( Q ) = det( Q + ) det( Q − ) the following relation can be deriveddet( Q ± ) ∝ det( ˆ Q ± )ˆ Q ± = γ ( M ± oo − M oe ( M ± ee ) − M eo ) , (B.5)where ˆ Q ± is only deﬁned on the odd sites of the lattice. In the HMC algorithmthe determinant is stochastically estimated using pseudo fermion ﬁelds φ o :det( ˆ Q + ˆ Q − ) = Z D φ o D φ † o exp( − S PF ) S PF ≡ φ † o (cid:16) ˆ Q + ˆ Q − (cid:17) − φ o , (B.6)where the ﬁelds φ o are deﬁned only on the odd sites of the lattice. In orderto compute the force corresponding to the eﬀective action S PF we need thevariation of S PF with respect to the gauge ﬁelds (using δ ( A − ) = − A − δAA − ): δS PF = − [ φ † o ( ˆ Q + ˆ Q − ) − δ ˆ Q + ( ˆ Q + ) − φ o + φ † o ( ˆ Q − ) − δ ˆ Q − ( ˆ Q + ˆ Q − ) − φ o ]= − [ X † o δ ˆ Q + Y o + Y † o δ ˆ Q − X o ] (B.7)with X o and Y o deﬁned on the odd sides as X o = ( ˆ Q + ˆ Q − ) − φ o , Y o = ( ˆ Q + ) − φ o = ˆ Q − X o , (B.8)where ( ˆ Q ± ) † = ˆ Q ∓ has been used. The variation of ˆ Q ± reads δ ˆ Q ± = γ (cid:16) − δM oe ( M ± ee ) − M eo − M oe ( M ± ee ) − δM eo (cid:17) , (B.9)and one ﬁnds δS PF = − ( X † δQ + Y + Y † δQ − X )= − ( X † δQ + Y + ( X † δQ + Y ) † ) (B.10)where X and Y are now deﬁned over the full lattice as X =  − ( M − ee ) − M eo X o X o  , Y =  − ( M + ee ) − M eo Y o Y o  . (B.11)In addition δQ + = δQ − , M † eo = γ M oe γ and M † oe = γ M eo γ have been used.Since the bosonic part is quadratic in the φ o ﬁelds, the φ o are generated atthe beginning of each molecular dynamics trajectory with φ o = ˆ Q + r o , (B.12)where r o is a random spinor ﬁeld taken from a Gaussian distribution withnorm one. 41 .1.1 Mass non-degenerate ﬂavour doublet Even/odd preconditioning can also be implemented for the mass non-degenerateﬂavour doublet Dirac operator D h eq. (5). Denoting Q h = γ D h the even/odd decomposition is as follows Q h =  ( γ + i ¯ µτ − ¯ ǫτ ) Q heo Q hoe ( γ + i ¯ µτ − ¯ ǫτ )  =  Q hee Q hoe  ·  Q hee ) − Q eo Q hoo  (B.13)where Q hoo is given in ﬂavour space by Q hoo = γ  i ¯ µγ − M oe (1 − i ¯ µγ ) M eo µ − ¯ ǫ − ¯ ǫ (cid:16) M oe M eo µ − ¯ ǫ (cid:17) − ¯ ǫ (cid:16) M oe M eo µ − ¯ ǫ (cid:17) − i ¯ µγ − M oe (1 − i ¯ µγ ) M eo µ − ¯ ǫ  with the previous deﬁnitions of M eo etc. The implementation for the PHMCis very similar to the mass degenerate HMC case. B.2 Inversion

In addition to even/odd preconditioning in the HMC algorithm as describedabove, it can also be used to speed up the inversion of the fermion matrix. Dueto the factorisation (B.3) the full fermion matrix can be inverted by invertingthe two matrices appearing in the factorisation  M ± ee M eo M oe M ± oo  − =  M ± ee ) − M eo M ± oo − M oe ( M ± ee ) − M eo )  −  M ± ee M oe  − . The two factors can be simpliﬁed as follows:  M ± ee M oe  − =  ( M ± ee ) − − M oe ( M ± ee ) −   M ± ee ) − M eo M ± oo − M oe ( M ± ee ) − M eo )  − =  − ( M ± ee ) − M eo ( M ± oo − M oe ( M ± ee ) − M eo ) − M ± oo − M oe ( M ± ee ) − M eo ) −  . The complete inversion is now performed in two separate steps: ﬁrst computefor a given source ﬁeld φ = ( φ e , φ o ) an intermediate result ϕ = ( ϕ e , ϕ o ) by:  ϕ e ϕ o  =  M ± ee M oe  −  φ e φ o  =  ( M ± ee ) − φ e − M oe ( M ± ee ) − φ e + φ o  . This step requires only the application of M oe and ( M ± ee ) − , the latter of whichis given by eq. (B.4). The ﬁnal solution ψ = ( ψ e , ψ o ) can then be computedwith  ψ e ψ o  =  M ± ee ) − M eo M ± oo − M oe ( M ± ee ) − M eo )  −  ϕ e ϕ o  =  ϕ e − ( M ± ee ) − M eo ψ o ψ o  , where we deﬁned ψ o = ( M ± oo − M oe ( M ± ee ) − M eo ) − ϕ o . Therefore, the only inversion that has to be performed numerically is the oneto generate ψ o from ϕ o and this inversion involves only an operator that isbetter conditioned than the original fermion operator.Even/odd preconditioning can also be used for the mass non-degenerate Diracoperator D h eq. (5). The corresponding equations follow immediately from theprevious discussion and the deﬁnition from eq. (B.13). C Initialising the PHMC

The function 1 / √ s in the interval [ ǫ,

1] can be approximated using polyno-mials or rational functions of diﬀerent sorts. In the tmLQCD package we useChebysheﬀ polynomials, which are easy to construct. They can be constructedas to provide a desired overall precision in the interval [ ǫ, P n,ǫ are neededfor the evaluation of the force. Even though the roots come in complex conju-43ate pairs, for our case the roots cannot be computed analytically, hence weneed to determine them numerically. Such an evaluation requires usually highprecision. This is why these roots need to be determined before a PHMC runusing an external program, i.e. they cannot be computed at the beginning ofa run in the hmc tm program.Such an external program ships with the tmLQCD code, which is located inthe util/laguere directory . It is based on Laguerre’s method and uses theClass Library for Numbers (CLN) [54], which provides arbitrary precision datatypes. In order to compute roots the CLN library must be available, which isfree software.Taking for granted that the CLN library is available, the procedure for com-puting the roots is as follows: assuming the non-degenerate Dirac operatorhas eigenvalues in the interval [˜ s min , ˜ s max ], i.e. ǫ = ˜ s min / ˜ s max , and the poly-nomial degree is n . Edit the ﬁle chebyRoot.H and set the variable EPSILON to the value of ǫ . Moreover, set the variable MAXPOW to the degree n . Adaptthe Makefile to your local installation and compile the code by typing make .After running the

ChebyRoot program successfully, you should ﬁnd two ﬁlesin the directory(1)

Square root BR roots.dat :which contains the roots of the polynomial in bit-reverse order [24].(2) normierungLocal.dat :which contains a normalisation constant.Copy these two ﬁles into the directory where you run the code and adjust theinput parameters to match exactly the values used for the root computation.I.e. the input parameters

StildeMin , StildeMax and

DegreeOfMDPolynomial must be set appropriately in the

NDPOLY monomial.The minimal and maximal eigenvalue of the non-degenerate ﬂavour doubletcan be computed as an online measurement. The frequency can be speciﬁedin the

NDPOLY monomial with the input parameter