[PDF] Effective noise reduction techniques for disconnected loops in Lattice QCD

Abstract

Many Lattice QCD observables of phenomenological interest include so-called all-to-all propagators. The computation of these requires prohibitively large computational resources, unless they are estimated stochastically. This is usually done. However, the computational demand can often be further reduced by one order of magnitude by implementing sophisticated unbiased noise reduction techniques. We combine both well known and novel methods that can be applied to a wide range of problems. We concentrate on calculating disconnected contributions to nucleon structure functions, as one realistic benchmark example. In particular we determine the strangeness contributions to the nucleon, <N|ss|N>, and to the spin of the nucleon, Delta s.

Full PDF

aa r X i v : . [ h e p - l a t ] M a y E ﬀ ective noise reduction techniques for disconnected loops inLattice QCD Gunnar S. Bali, Sara Collins, Andreas Sch¨afer

Institut f¨ur Theoretische Physik, Universit¨at Regensburg, 93040 Regensburg, Germany

Abstract

Many Lattice QCD observables of phenomenological interest include so-called all-to-all prop-agators. The computation of these requires prohibitively large computational resources, unlessthey are estimated stochastically. This is usually done. However, the computational demandcan often be further reduced by one order of magnitude by implementing sophisticated unbiasednoise reduction techniques. We combine both well known and novel methods that can be ap-plied to a wide range of problems. We concentrate on calculating disconnected contributions tonucleon structure functions, as one realistic benchmark example. In particular we determine thestrangeness contributions to the nucleon, h N | ¯ ss | N i and to the spin of the nucleon, ∆ s . Keywords:

Lattice QCD, stochastic estimates, deep inelastic scattering, nucleon structure

PACS:

1. Introduction

Advanced Lattice QCD calculations often require the evaluation of diagrams with discon-nected quark lines. Important examples are properties of ﬂavour singlet mesons [1], QCD spec-troscopy including multiquark and scattering states [2], the determination of hadronic scatteringlengths [3], and of isosinglet contributions to hadronic form factors and structure functions [4, 5].Moreover, statistical errors of some standard observables like meson masses and electroweak de-cay constants can be reduced by averaging the sources over the lattice volume. In all these casesthe standard point-to-all propagators that are obtained by calculating twelve (colour times spin)columns of the inverse lattice Dirac operator are not su ﬃ cient. Instead, all-to-all or timeslice-to-all propagators need to be computed.Lattices typically contain a number of sites ranging from V ≈ up to V ≈ . Invertinga lattice Dirac operator by conventional means would increase the e ﬀ ort in terms of computermemory and operations spent by this factor, relative to the cost of obtaining a single point-to-allpropagator. An alternative approach to the problem consists of calculating an unbiased stochasticestimate of the propagator, replacing a factor 12 V in e ﬀ ort by a number N of random sources,where in certain situations N can be as small as 10. Such estimation is permissible since, provided Email addresses: [email protected] (Gunnar S. Bali), [email protected] (Sara Collins), [email protected] (Andreas Sch¨afer)

Preprint submitted to Computer Physics Communication October 1, 2018 t is unbiased, it will only add to the statistical errors. These are present in any case, due to thepath integral evaluation of expectation values of observables by means of a Markov Monte Carlosimulation (for an introduction into Lattice QCD simulations see e.g. [6]).In general, the additional error introduced by this procedure will, for su ﬃ ciently large num-bers of estimates N , decrease in proportion to 1 / √ N . Ideally, N should be chosen such that thisadditional error is smaller than the statistical error induced by the Monte Carlo time series. Theoptimal number will depend on the observable in question and on the prescription employed tocalculate the stochastic estimate. In this article we discuss di ﬀ erent improvement methods aimedat decreasing the prefactor of the asymptotic 1 / √ N behaviour. In particular, we will introduce anew such method which we call the truncated solver method (TSM). We then test combinationsof di ﬀ erent improved stochastic estimator methods in a realistic Lattice QCD simulation. Asour benchmark examples we choose to calculate the strangeness contribution to the spin of thenucleon ∆ s as well as the scalar strangeness content h N | ¯ ss | N i .The spin of the nucleon can be factorized into a quark spin contribution ∆Σ , a quark angularmomentum contribution L q and a gluonic contribution (spin and angular momentum) ∆ G :12 = ∆Σ + L q + ∆ G . (1)In the na¨ıve nonrelativistic SU(6) quark model, ∆Σ =

1, with vanishing angular momentum andgluon contributions. In this case sea quark contributions will be absent too and therefore therewill be no strangeness contribution ∆ s in the factorization, ∆Σ = ∆ d + ∆ u + ∆ s + · · · , (2)where in our notation ∆ q contains both, the spin of the quarks q and of the antiquarks ¯ q . Experi-mentally, ∆ s is usually obtained by integrating the strangeness contribution to the spin structurefunction g over the momentum fraction x . The integral over the range in which data exists( x & . x ≥ .

02 yields [7] ∆ s = . x and are model dependent [8, 9].The standard Hermes analysis [10] yields ∆ s = − . µ = in the MS scheme. Our results below suggest, ∆ s = − . ﬀ ects in the chiral extrapolation from a pseudoscalar mass m PS ≈

450 MeV to the physicalpoint.The scalar strangeness density is not directly accessible in experiment but plays a rˆole inmodels of nuclear structure. It is also of phenomenological interest since, assuming that heavyﬂavours are strongly suppressed, the dominant coupling of the Higgs particle to the nucleon willbe accompanied by this scalar matrix element.The outline of this article is as follows: in Sec. 2 we introduce our notation and explainthe stochastic estimator and noise reduction methods applied. In Sec. 3 we detail our latticesetup and employ these techniques to calculate disconnected quark loop contributions. Finally,in Sec. 4 we present our results on the axial and scalar nucleon strangeness matrix elements,before we conclude. The systematic tuning of the parameters used in the truncated solver methodis described in detail in Appendix A. The TSM is quite generally applicable. This is alsodemonstrated in the appendix where the conjugate gradient (CG) solver is replaced by the popularstabilized biconjugate gradient (BiCGstab2) solver [11]. Preliminary results appeared in [12]and [13]. 2 . Noise reduction techniques

As outlined above we require propagators M − connecting arbitrary pairs of lattice points,where M denotes a lattice Dirac operator. In our speciﬁc case we employ the Wilson quarkaction [14, 6]. Since the propagators have 12 V × V components, where V denotes the number oflattice points, direct evaluation would be prohibitively expensive, both in terms of computer timeand of memory. However, we encounter statistical errors anyway from the importance samplingof path integral expectation values. Hence it is su ﬃ cient to aim at an (unbiased) estimate, whichcan be obtained by stochastic methods [15]. For this purpose we introduce the following notation, A = A N : = N N X j = A j . (3) N denotes the number of “stochastic estimates”. Let | η i i , i = , . . . , N be random vectors withthe properties, | η i = O (cid:16) / √ N (cid:17) , (4) | η ih η | = + O (cid:16) / √ N (cid:17) . (5)These requirements are for instance met by complex Z noise where the 12 V components arenumbers e i φ , with the uncorrelated random phases φ ∈ {± π/ , ± π/ } . In [16, 17, 18] it hasbeen demonstrated that real and complex Z noise reduces the variance, relative to other choicessuch as Gaussian or double hump noise. In our experience, similarly small variances can also beobtained with Z N , U(1) or even with SU(3) noisy sources [19], indicating that the only importantproperty is an equal modulus of the components. In the present study we employ complex Z noise, where the components of the random vectors | η i i run over the spacetime volume, spin andcolour.We deﬁne the Hermitian lattice Dirac operator Q = γ M . If we solve the linear systems, Q | s i i = | η i i , (6)for | s i i , then we can substitute, see Eq. (5), Q − = | s ih η | + Q − (cid:16) − | η ih η | (cid:17) ≈ E( Q − ) : = | s ih η | . (7)The di ﬀ erence between the approximation of Eq. (7) above and the exact result is unbiased andreduces like 1 / √ N . The sparse linear system of Eq. (6) can for example be solved by means ofthe conjugate gradient (CG) or the stabilized biconjugate gradient (BiCGstab2) algorithms. Notethat in the CG case we can actually use even / odd preconditioning by employing the operator M † M = Q . We then obtain | s i i by multiplying the result with Q .The stochastic estimate approach reduces the problem from O (12 V × V ) to O ( N × V ), interms of memory and computer time. The stochastic error will remain roughly constant if N isscaled like √ Va with the lattice volume, where a denotes the lattice spacing. In order to limitthe computational e ﬀ ort, N should not be chosen overly large. However, in the end the stochasticnoise should not be the dominant source of statistical error. In general, the optimal balance be-tween the stochastic sampling on single conﬁgurations and the Monte Carlo importance samplingof gauge conﬁgurations will depend on the observable in question and on the methods used.3ue to the di ﬀ erence between | s ih η | and Q − above, any fermionic observable A can only beestimated up to a stochastic error ∆ A , stoch = O (1 / √ N ) on a given conﬁguration. We deﬁne theconﬁguration average h·i c over n uncorrelated conﬁgurations and normalize this appropriately: σ A , stoch : = h ∆ A , stoch i c n . (8)For large N and n this will scale like σ A , stoch ∝ ( Nn ) − . We also deﬁne the total error σ A , tot asthe variation of the obtained estimates of A over gauge conﬁgurations. At ﬁxed N this will beproportional to 1 / √ n . Obviously, this total error is limited by the stochastic error : σ A , tot > σ A , stoch . (9)If σ A , stoch ≃ σ A , tot then it is worthwhile to improve the quality of the estimates while if σ A , stoch ≪ σ A , tot then precision can only be gained by increasing the number of conﬁgurations n , possiblyreducing N to save computer time since the n − scaling cancels from the above inequality.To minimize the stochastic noise at a given computational e ﬀ ort, we combine a multitude oftechniques:1. partitioning (also known as the spin explicit method or as dilution) [18, 21, 22],2. the hopping parameter expansion (HPE) technique [23, 24, 25],3. the truncated solver method (TSM) [12] and4. the truncated eigenmode acceleration (TEA) [26, 25, 22].These methods are explained below. We decompose R = V ⊗ colour ⊗ spin into m subspaces R j : R = ⊕ mj = R j . One can then setthe source vectors | η i | j i to zero, outside of the subspace R j . We label the corresponding solutionsas | s i | j i . The solution for the all-to-all propagator is then given by the sum, Q − ≈ m X j = | s | j ih η | j | . (10)This procedure results in an m -fold increase in the total number of solver applications. The term − | η ih η | multiplying Q − in Eq. (7) only has o ﬀ -diagonal entries, all of similar sizes. Thereforethe o ﬀ -diagonal entries of Q − will determine the stochastic error of the particular observablesin question. Q − will exponentially decay with the spacetime distance between source and sink:spacetime components in the neighbourhood of the sink position will yield the leading contribu-tions to the stochastic variance. Likewise the noise for spin components that are strongly coupledto each other by Q − , multiplied by the relevant Γ and derivative structures, will dominate. Ide-ally, partitioning will black out the largest contributions to the stochastic noise. If the achievedreduction exceeds the 1 / √ m factor, then the computational overhead is justiﬁed. This overhead For an exact calculation of A ( σ A , stoch = σ A , stat = σ A , tot ∝ / √ n can be introduced. In thelimit of large N and n , the factorization σ A , stat = σ A , tot − σ A , stoch holds, see e.g. [20]. For our considerations there is noneed to introduce this quantity. / or aggressive deﬂation [27, 28, 29, 30, 31, 32]: the constant set-up costs of such techniquesbecome more a ﬀ ordable with a larger number of right hand sides. We experimented with variouspatterns and combinations of colour, spin and time partitioning, see e.g. [33]. Partitioning as astand-alone solution works very well in many situations [21, 34, 20, 22]. However, such gainsare mostly eliminated once partitioning is combined with the other tricks. Time partitioning is anotable exception: in situations where only timeslice-to-all propagators are required there is noincrease in the number of solves but the variance is still reduced. In the calculation of discon-nected contributions to the nucleon structure it also turns out to be useful to generate the currentinsertion at more than one timeslice, e.g. to average the correlation with a nucleon propagatingfrom a given source in the forward direction and the backward propagating one (which comesfor free). In this case one can seed the random sources at two (or more) well separated timeslicesand still gain from partitioning, without any associated overhead. The stochastic noise from terms that are close to the diagonal is accompanied by larger am-plitudes than terms that are far o ﬀ the diagonal, see Eq. (7) and the discussion in Sec. 2.1 above.Hence the cancellation of near-diagonal noise requires a comparatively larger number of esti-mates. The HPE aims at eliminating some of the near-diagonal noise contributions by exploitingthe ultra-locality of the action. Thus it cannot be generalized for instance to the Neuberger ac-tion [35].We rewrite the fermionic matrix as, 2 κ M = − κ D . (11)The HPE is based on the observation that one can expand, M − = κ ∞ X i = ( κ D ) i = κ k − X i = ( κ D ) i + ( κ D ) k M − , (12)where k ≥

1. For distances between source and sink consisting of more than k links, the ﬁrst termon the right hand side does not contribute since D only connects nearest spacetime neighbours.Therefore, M − xy = [( κ D ) k M − ] xy for su ﬃ ciently large source and sink separations. However, atﬁnite N , { E[ M − ] } xy , { E[( κ D ) k M − ] } xy ( = { ( κ D ) k E[ M − ]) } xy ), where the variance of the latterestimate of M − xy is reduced. This was for instance exploited in [25]. In some cases additionalpowers of D can be gained due to the Γ structure of the creation and annihilation operators.Here we study closed loops, i.e. x = y . Obviously, only even powers of D contribute toTr ( M − Γ ), where Γ ∈ { , γ µ , σ µν , γ µ γ , γ } . We can write, Tr ( M − ) = κ Tr + κ k Tr ( D k M − ),where the ﬁrst term can be computed trivially. In the other cases, Tr ( M − Γ ) = κ k Tr ( D k M − Γ ).For the Wilson action, k = Γ ∈ { γ µ γ , γ } and k = k =

2. The lowest non-vanishing terms have been calculated analyti-cally [23] and can be computed and corrected for exactly (unbiased noise subtraction) [23, 37, 5].We limit ourselves to the highest vanishing order ( k = k = .3. The truncated solver method It has been noticed long ago [38] that solvers typically converge to the correct result within anaccuracy of the size of the stochastic error after a relatively small number of iterations. Practition-ers have therefore sometimes relaxed the requirement on the residual when solving for stochasticsources. This is not unproblematic since it introduces a systematic bias that will be invisible onone conﬁguration but might very well a ﬀ ect the result obtained on a sample of, say, 200 con-ﬁgurations. We label the result obtained after n t solver iterations by | s i ( n t ) i , where omitting thesubscript means convergence within numerical accuracy. We can now factorize: Q − ≈ E( Q − ) : = N N X i = | s i ( n t ) ih η i | + N N + N X i = N + (cid:16) | s i i − | s i ( n t ) i (cid:17) h η i | . (13)The above equation is an exact linear decomposition of Q − and the algorithm used to calculateboth parts is well deﬁned. Due to these properties and the fact that the | η i i with i ≤ N are uncor-related with those for i > N , the resulting estimate is unbiased. Ideally, one will generate a largenumber N ≫ N of relatively cheap estimates at small n t and then remove the bias by correctingwith a small number N of expensive solutions to machine precision. This method can easily becombined with all the other methods described here. It is possible to tune the two parameters N / N and n t that enter the algorithm to the point of minimal variance at ﬁxed computer time ina relatively inexpensive and straightforward way. This is discussed in Appendix A.In some sense the underlying philosophy of TSM is similar to estimating the cheap ﬁrst termwithin the HPE Eq. (12) with many random sources and the expensive second term with onlya few sources. However, iterative solvers like the CG converge faster than the HPE. Moreover,TSM is applicable to any fermion action, not only to ultra-local ones, and e ﬃ cient for any quarkmass. TSM can be combined with HPE, see Sec. 2.5 below.To check our implementation of the method we compared the exact result for ( M − ) s c , s c x , y ,where s c denotes the spin and colour indices, x = (0 , , ,

3) and y = ( i , , , i = . . . ﬀ erent n t , N and N . For example, for n t = N = N = i = s = s = c = c = M − n c ] = (0 . , − . . . . . , − . . . . ).In Appendix A we also demonstrate that the TSM can be generalized to other solvers, byemploying BiCGstab2. However, the smooth convergence of CG, that is relatively independentof the random source and gauge conﬁguration, is a clear advantage when it comes to deciding ona TSM parameter set. Moreover, in the CG case the combination with the truncated eigenmodeacceleration bears less computational overhead and is compatible with even-odd preconditioning(see below). We emphasize however that the converged solutions | s i i within Eq. (13) can beobtained by using any e ﬃ cient solver. In particular, there is no need to employ the same solveras that used for obtaining the truncated solutions | s i ( n t ) i . The truncated solver of course needs tobe the same for i > N as that used for i ≤ N . We deﬁne the n e smallest (real) eigenvalues q i of Q and the corresponding orthonormal eigen-vectors | u i i , i = , . . . , n : Q | u i i = q i | u i i , h u i | u j i = δ i j . (14)These can for instance be calculated by means of the parallel implicitly restarted Arnoldi method(IRAM) with Chebychev acceleration [39] or by Ritz methods [40].6e can now approximate [39], Q − ≈ n e X i = | u i i q − i h u i | . (15)However, this estimate is biased. We deﬁne the projection operator P n e , onto the complement ofthe space spanned by these n e eigenvectors, P n e = − n e X i = | u i ih u i | . (16)Projecting the stochastic sources onto the orthogonal complement, Q | s i ⊥ i = | η i ⊥ i : = P n e | η i i , (17)yields the new unbiased estimate [25], Q − ≈ E( Q − ) : = n e X i = | u i i q − i h u i | + | s ⊥ ih η ⊥ | , (18)where a left projection of | s i ⊥ i is not necessary since [ Q , P n e ] = P n e Q P n e = Q P n e . Theabove estimate will usually have a reduced variance. Note that in the literature exact point-to-allpropagators have also been combined with a truncated low eigenmode all-to-all calculation toachieve smaller gauge errors (low mode averaging) [41, 42].A nice side e ﬀ ect of TEA lies in the acceleration of the solver, by deﬂation [43, 27, 29, 31,44, 32]. The condition number of the projected operator Q P n e and therefore the number of solveriterations are reduced. Within some algorithms like the CG the fact that the projector commuteswith the operator Q guarantees that the Krylov subspace remains conﬁned within the orthogonalcomplement. So in these cases no further intermediate projections are necessary to fully exploitthe potential of deﬂation.Note that while the number of undeﬂated solver iterations increases like 1 / m at small quarkmasses, the e ﬃ ciency of the eigenvalue calculation remains the same. Unfortunately, at small m PS , one would like to increase the linear lattice extent L ∝ / m PS . In this case, the rank of thedeﬂation space in the worst case will increase like n e ∝ L ∝ Va . Depending on the volumerequired this may or may not become a serious problem. The volume scaling of stochasticmethods is somewhat more favourable: the number of estimates needs to be adjusted at most like √ Va (this can be somewhat reduced by partitioning). From these considerations it becomesevident that the optimal choices of n e and N strongly depend on the volume, the quark massand the lattice spacing. The same holds for the algorithm where a deﬂated CG can outperform a(deﬂated) BiCGstab2, in particular when combined with the truncated solver method.Obviously, it is also possible to decompose M , rather than the Hermitian operator Q = γ M of Eq. (15), into eigenmodes [45]: M | r i i = λ i | r i i . However, in this case we end up with abiorthonormal system h ℓ i | r j i = δ i j , where the left eigenvectors | ℓ i i will di ﬀ er from the righteigenvectors: h ℓ i | M = h ℓ i | λ i . One can now decompose, M − ≈ n e X i = | r i i λ − i h ℓ i | . (19)7 O O Γ t t t Figure 1: Connected and quark-line disconnected current insertion into the nucleon.

It can easily be seen that due to the property M † = γ M γ , ˜ λ i : = λ ∗ i is an eigenvalue whenever λ i is an eigenvalue, with a left eigenvector h ˜ ℓ i | = h r i | γ and a right eigenvector | ˜ r i i = γ | ℓ i i . Theadvantage of this decomposition is that the eigenvectors are independent of the quark mass andthe eigenvalues at di ﬀ erent κ values are trivially related to each other. This follows from thestructure Eq. (11). In this article we will not pursue this alternative eigenmode decompositionany further. However, the decomposition of M − might converge better in some channels thanthat of Q − and vice versa. While this biorthogonal approach is incompatible with deﬂating theCG solver, it would be the natural starting point for BiCG solvers. Partitioning can trivially be combined with the other three methods. However some notesare in place with respect to combinations of HPE, TSM and TEA. Within the HPE a left-multiplication of the estimate E[ Q − ] with D k is essential since we have deﬁned M − = Q − γ .A right-multiplication would involve γ D γ = D † . The easiest way of implementing HPE isto multiply the solution vectors with ( κ D ) k , prior to any application of smearing functions orcontractions but after TEA and TSM.Within Eq. (18) only the projected source vectors appear. Hence after the projection, Eq. (17), | η i ⊥ i = | η i i − n e X j = h u j | η i i | u j i , (20)the original noise vectors | η i i can be discarded.The overhead from the projection can be signiﬁcant when combined with the TSM, Eq. (13),where the large number of estimates N implies a large number of projections and the smallnumber of solver iterations n t indicates that not much computer time will be spent between theseprojections. The projection overhead can be reduced by restricting the computation of the innerproducts to the partitioning subspace: many components of | η i i might have been set to zeroif the partitioning method was used. This saving cannot be obtained at the sink. Fortunately,[ P n ev , Q ] =

0, such that the solutions | s i ⊥ i remain within the orthogonal complement and nosecond projection is necessary, provided the residual of the solver is chosen su ﬃ ciently small!8bviously the residual will not be small for the truncated solutions, which therefore onewould wish to project back into the orthogonal complement (before application of the D opera-tor within the HPE). Fortunately, the (even-odd preconditioned) CG algorithm never leaves theorthogonal subspace: in this case, such an additional projection is never necessary, even if thesolver is not run to convergence. Neither is such a left projection strictly required for solverswithout this property, as long as the precision of the part that is run to convergence is su ﬃ cientlylarge. In this case, the N biased estimates will pick up some unwanted low mode contributionsthat will however be corrected for by the N estimates of the bias. This might or might notincrease the stochastic errors.For completeness we mention that domain decomposition techniques have been suggested inthe literature [46, 47]. These can easily be combined with TSM and TEA too, as can the so-calledone-end-trick [48, 49, 50]. We remark that if a bosonic representation is employed [46, 51],estimating M † M = Q instead of Q , then the TSM can in principle be substituted by a quasiheatbath strategy [52].

3. Evaluation of the disconnected loop

We study combinations of the variance reduction techniques outlined above, using conﬁgu-rations provided by the Wuppertal group: these are n f = + V = ×

32 lattice points, generated using a Symanzik improved gauge action and a stout-linkimproved (rooted) staggered fermion action. The lattice spacing is fairly coarse, a − ≈ .

55 GeV,while the spatial extent is around 2 fm. Further details can be found in [53]. For the valencequarks we use the Wilson action with κ = . . . κ loop = . κ loop = . κ loop = . κ loop refers to the κ value of the disconnected loop. Our main resultsare obtained using the CG solver with even-odd preconditioning. However, results obtainedwith BiCGstab2 are given in Appendix A. The code used throughout is a modiﬁed version ofChroma [54, 55, 56]. We wish to calculate nucleon structure observables. For this purpose a nucleon will be createdat some initial time t and destroyed at a later time t f ≫ t , with a current inserted at someintermediate time t . This is illustrated in Fig. 1. The disconnected loop Tr ( M − Γ ) within the rightdiagram of the ﬁgure will be calculated with stochastic all-to-all techniques and can subsequentlybe combined with the nucleon two-point function, calculated in the standard point-to-all way, ona conﬁguration by conﬁguration basis.A preliminary picture of how well the variance reduction techniques work can be obtainedby studying the zero-momentum projected disconnected loop P x Tr ( M − xx Γ ) alone, where x = ( x , t ). With the exception of Γ = , expectation values of these loops over many conﬁgurationsvanish, due to the discrete charge and parity symmetries of QCD. However, the expectationvalue of the correlation between loop and proton (with a momentum injected or with a speciﬁedhelicity) can be non-zero. Likewise, with the exception of trivial cases such as Im Tr ( M − γ ) =

0, the loops will not vanish on single conﬁgurations. Using the Euclidean spacetime convention9

20 40 60 80 n t T r( M - ) n t -30-25-20-15-10-5051015202530 T r( M - γ γ ) Figure 2: Truncated estimates of the zero momentum projected Tr ( M − Γ ), obtained after n t CG solver iterations for

Γ = (left) and for Γ = γ γ (right) at κ loop = . { γ µ , γ ν } = δ µν , γ = γ γ γ γ , we are interested in evaluating Re Tr ( M − ), Im Tr ( M − γ µ ),Im Tr ( M − σ µν ) = Re Tr ( M − γ µ γ ν ) for µ , ν , Re Tr ( M − γ µ γ ) and Im Tr ( M − γ ).We investigate the reduction in computer time of optimized stochastic estimates, relative tothe situation without the use of any improvement techniques (except for time partitioning: alldisconnected loops are calculated on one timeslice only). We state all costs in terms of theaverage real computer time required on a Pentium 4 PC for one solver application (unimprovedestimate), where we account for all overheads of the improvement methods. We also ran theparallelized code on IBM p6 and Opteron clusters as well as on BlueGene / L and BlueGene / Psystems. The overall e ﬃ ciencies of matrix-vector multiplies and global sums depend on thearchitecture, however, the gain factors deﬁned below are not signiﬁcantly a ﬀ ected.We deﬁne the gain as the ratio,gain( Γ ) = var { E [Tr ( M − Γ )] } var { E imp [Tr ( M − Γ )] } , (21)taken at ﬁxed real computer time. var denotes the variance between stochastic estimates ona given gauge conﬁguration, E and E imp stand for the unimproved and improved estimated,respectively.We start by investigating the TSM at the heaviest quark mass, κ loop = . M − Γ ) is obtained as an average over N truncated solutions | s i ( n t ) i : h η i | γ Γ | s i ( n t ) i .This value is then corrected by N ≪ N estimates of the resulting bias, see Eq. (13). The fasterthe truncated estimate as a function of n t approaches the estimate obtained at full convergence,the better the method will work. In this study we only consider currents of local quark bilinears,where we use the Γ conventions of [54]. We observe very satisfactory convergence rates for allthe 16 possible Γ structures. In Fig. 2 we illustrate this for the even / odd preconditioned CG solverfor the worst case ( Γ = ) and for the best case ( Γ = γ γ ). Note, however, that the unimprovedestimate of Γ = is already very precise to start with. Convergence at this κ value is reachedafter n conv ≈

480 CG iterations.Applying TSM involves making choices for n t and for N / N . We detail our optimization pro-cedure in Appendix A. This systematic tuning results in similar amounts of computer time beingspent on the truncated estimate as on estimating the bias. In particular we ﬁnd N / N ≈ n conv / n t .Performing this optimization on a single conﬁguration appears su ﬃ cient since the variance is not10 imp [Tr (M − Γ )] TSM TSM + HPE Γ γ γ γ γ γ γ γ γ γ γ γ γ n t

50 27 14 18 18 66 78 50 78 90 N / N

23 21 32 28 30 26 25 21 26 26 k Table 1: Optimized TSM values for n t and N / N , see Eq. (13), at κ = .

166 for a subset of the Γ s studied, calculated onone conﬁguration using 300 estimates. The approximate gains obtained at ﬁxed cost, using these values, are also shown.Where our method is combined with the HPE technique, k indicates the order used. k T r(( κ D ) k M - ) k -10-8-6-4-2024 T r(( κ D ) k M - γ γ ) Figure 3: Estimates of the zero momentum projected Tr [( κ D ) k M − Γ ] at κ loop = .

166 for

Γ = (left) and Γ = γ γ (right). The errors are obtained from 300 estimates on one gauge conﬁguration and the zero-order HPE contribution for Γ = was calculated explicitly. much a ﬀ ected if n t and N / N are changed by 20 %. Fluctuations between conﬁgurations turnout to be smaller than this value.In Table 1 we display the optimized parameters at κ loop = .

166 for a representative choice of Γ structures (scalar, vector, tensor, pseudoscalar and axial vector), together with the approximateﬁxed cost gains, Eq. (21). These factors vary between 5 and 10. In a real production run onewould not wish to generate new sets of optimized estimates for each Γ structure or observable oneis interested in. Fortunately, with the still tolerable exception of Γ = , n t and N / N exhibit onlya mild Γ dependence such that similar overall gains can still be achieved with just one parameterset. TSM can trivially be combined with the hopping parameter expansion. At κ loop = . Γ = , see Fig. 3. Both methods aim at removing noisy shortrange contributions from the estimates. Clearly, the gain from combining the two methods willbe smaller than the product of the two isolated gains. However, the separation of long and shortrange is organized di ﬀ erently. The HPE only works for ultra-local actions and explicitly removesterms up to a particular lattice hopping radius. It will be less convergent and hence less e ﬀ ective atvery light quark masses. The TSM is more generally applicable and separates short range modesfrom long range ones, but this is done within the Krylov space of the solver. When combiningHPE with TSM, the reduced variances of the truncated estimates result in larger optimal n t values.The constant computational overhead per solve of the additional D applications also pushes n t + HPE m PS γ γ γ γ γ γ γ γ γ γ γ γ

600 MeV 5 5 10 8 8 8 11 19 25 30450 MeV 5 5 10 8 8 7 11 17 22 25300 MeV 5 5 10 8 8 6 9 15 17 19

Table 2: The TSM gains without and in combination with the HPE at di ﬀ erent quark masses.

0 200 400 600 800 1000 1200 1400 iterations | r | P P P P Figure 4: Squared residual of the solver, as a function of the number of CG iterations. P n ev denote the outcomes, afterdeﬂating the n ev lowest modes. towards larger values. Table 1 demonstrates that the combined gain of these two methods at κ loop = .

166 can be as large as a factor of 30.Having tested the method at κ loop = . m PS ≈

600 MeV, we also use it to calculatethe disconnected loop at κ loop = . κ loop = . ﬀ ective at lighter masses, in agreement with our expectation. Still, even at the lightest quarkmass, the combined gains range from factors of 6 (for Γ = ) up to 19 (for Γ = γ γ ).At light quark masses where the HPE becomes less e ﬀ ective, the low eigenmode contribu-tions to hadronic observables might become more dominant [39, 41, 42, 25]. To study this e ﬀ ect,we deﬂate the lowest eigenmodes at the lightest pseudoscalar mass. As expected, this acceleratesthe solver [43, 26, 27, 28, 29, 30, 31, 44, 32]. We display a typical residual, as a function of thenumber of CG iterations, without deﬂation and deﬂating the lowest 5, 10, 20 and 50 modes ofthe Hermitian Wilson-Dirac operator in Fig. 4.The optimized TSM parameters at κ loop = . imp [Tr (M − Γ )] TSM + HPE TSM + HPE + TEA( P ) Γ γ γ γ γ γ γ γ γ γ γ γ γ n t

155 220 155 160 240 90 100 60 90 100 N / N

18 17 19 16 16 30 28 27 32 28 k Table 3: The optimized TSM parameters and gains obtained at κ loop = . configuration n c onv κ =0.1675 κ =0.1675 P κ =0.166 configuration n c onv κ =0.1684 κ =0.1684 P Figure 5: The numbers of iterations to convergence. For the heavier two κ values (left) the comparison is with 70 sourcesper conﬁguration, for κ = . for combining TSM with HPE and the TEA of the lowest 20 eigenmodes are shown in Table 3.The faster rate of convergence of the deﬂated solver leads to smaller n t values and thereforeto larger N / N ratios when TSM is combined with TEA. When normalized to the real cost ofgenerating 300 unimproved stochastic estimates, all gain factors increase by more than a factor of two , and this in spite of one quarter of the time being spent on generating and projecting onto theeigenvectors. However, this factor can fully be attributed to the acceleration of the solver: whilethere exist observables where stochastic errors decrease when applying the TEA [39, 41, 25], forthe quark loops that we investigate here this is not the case, at least not at pseudoscalar massesabove 300 MeV. We observe a break-even between TSM + HPE + TEA( P ) and TSM + HPE whenmatching the cost of approximately 100 estimates, as can be seen in the last row of the table. Inlattice simulations we will encounter gauge errors from the Monte Carlo time series, in additionto the errors from the stochastic estimates on single conﬁgurations, discussed here. This interplayis studied below.

When calculating averages over conﬁgurations, the stochastic errors σ stoch , that are deﬁnedin Eq. (8), should be smaller than say half of the ﬁnal Monte Carlo errors σ tot . However, making σ stoch unnecessarily small can be a waste of computer time that would be more wisely spenton analysing more conﬁgurations or realizing additional source positions. We start to study this13alance of errors by computing expectation values over conﬁgurations for the disconnected loops.In Sec. 4 below we will then calculate the three point functions that are of physical interest.We remark that the (zero momentum projected) gauge average h Re Tr ( M − ) i c will havea non-trivial value, while h Re Tr ( M − γ µ γ ) i c =

0. We ﬁrst assess the impact of low modedeﬂation. In Fig. 5 we display the number of CG iterations until convergence n conv for severalrandom sources on di ﬀ erent gauge conﬁgurations. Without deﬂation, reducing the quark massfrom m PS ≈

600 MeV ( κ = . m PS ≈

300 MeV ( κ = . n conv to grow from about 500 to 1800 iterations.This approximately four-fold increase is consistent with the expected 1 / m behaviour of thecondition number of M (we solve for Q = M † M ). Deﬂating 5 eigenmodes ( P ) at κ = . κ = . n conv between and within conﬁgurations that increasewith decreasing quark mass, as has been investigated quantitatively in [57]. Deﬂating the lowestmodes vastly reduces the variance in n conv . Clearly, the TSM parameter tuning beneﬁts from thisstabilization.We now average our improved estimates over n conﬁgurations, where we employ TSM, HPEand, at the lightest κ value, TEA with a projection onto the 20 lowest modes. In addition, wecalculate up to 100 unimproved estimates on each conﬁguration. The results as functions ofthe real computer time spent, in units of the cost of one unimproved estimate, i.e. of one solve,are displayed in Table 4, for Γ = and Γ = γ j γ , where we average over the three possible j -directions. The total errors σ tot are displayed in brackets, followed by their respective lowerbounds, as given by the stochastic errors σ stoch .For Γ = , even at the cost equivalent of six standard estimates, the gauge errors still domi-nate over the improved stochastic errors. Therefore, there is little point in exceeding this number.On the given ensemble, the same error can be obtained using 25 to 50 standard estimates, sug-gesting a ﬁve-fold saving in computer time. However, this is somewhat misleading since, by justincreasing the number of gauge conﬁgurations by a factor of 1.6, the same error can be obtained,employing six unimproved estimates. It should be noted that the error balance could in princi-ple look di ﬀ erently, once the loop is correlated with a nucleon propagator, within a three pointfunction. Moreover, the stochastic error will become more relevant on large volumes. At thelightest κ value, due to the overhead of setting up TEA, the cost of the improved estimates willalways exceed the number 6. In fact, as can also be seen from Table 4, TEA only turns out tobe a worthwhile enterprise for a cost equivalent bigger than 90. Hence, unless the eigenvectorshave been generated anyway, for instance to enable low mode averaging [41] for the two pointfunctions or for the calculation of other current insertions, TEA is best omitted for Γ = .The picture for the axial current containing Γ = γ j γ is di ﬀ erent: here at the cost equivalentof 100 standard estimates, the stochastic error still accounts for one quarter of the total error andthe gain of applying the improved method is in all cases more than four-fold. This advantageshould increase further at larger volumes. Note that the reductions of the stochastic errors alone,for the scalar and axial cases, are consistent with the results of Table 2 that were obtained on asingle conﬁguration.Based on these results, we decide to invest the cost equivalent of 100 even / odd preconditionedCG solves into the TSM / HPE / (TEA) estimation of the axial loop and of 6 CG solves into theestimation of the scalar loop. Note that for the calculation of a standard point-to-all baryonictwo-point correlation function usually 12 such solves are required, and even more sources if avariational basis is used to optimize the creation operator. We display the resulting stochastic14 cost E imp σ impstoch E σ stoch κ loop = .

166 300 14702.6 (7) 0.04 m PS ≈

600 MeV 200 14702.6 (7) 0.05 n =

326 100 14702.6 (7) 0.06 14702.9 (7) 0.1750 14702.6 (7) 0.09 14703.0 (8) 0.2325 14702.5 (7) 0.13 14703.1 (8) 0.3312 14702.5 (7) 0.18 14703.5 (9) 0.476 14702.3 (8) 0.23 14703.7(1.0) 0.65 κ loop = . m PS ≈

450 MeV 200 14743.0(1.1) 0.08 n =

167 100 14743.2(1.1) 0.11 14743.3(1.1) 0.2550 14743.2(1.1) 0.16 14743.8(1.1) 0.3325 14743.2(1.1) 0.23 14744.2(1.2) 0.4712 14743.4(1.2) 0.33 14745.0(1.3) 0.696 14743.5(1.2) 0.42 14744.6(1.5) 0.96 κ loop = . m PS ≈

300 MeV 200 14764.9(1.2) 0.05 n =

152 100 14764.9(1.2) 0.08 14764.6(1.2) 0.2790 14765.0(1.2) 0.10 14764.6(1.2) 0.2950 ∗ ∗ P j γ j γ κ loop = .

166 300 -0.008 (50) 0.016 m PS ≈

600 MeV 200 0.007 (51) 0.019 n =

326 100 -0.033 (55) 0.027 -0.185(148) 0.13550 -0.054 (64) 0.039 -0.446(201) 0.186 κ loop = . m PS ≈

450 MeV 200 -0.096 (91) 0.038 n =

167 100 -0.040(101) 0.054 0.003(211) 0.19850 -0.038(114) 0.076 0.056(265) 0.271 κ loop = . m PS ≈

300 MeV 200 -0.067 (96) 0.020 n =

152 100 -0.068 (96) 0.036 -0.089(216) 0.21290 -0.072 (99) 0.042 -0.042(227) 0.22350 ∗ -0.141(106) 0.05725 ∗ -0.199(120) 0.082 Table 4: Monte Carlo and stochastic errors for the (TSM + HPE) improved and unimproved estimates of Re Tr ( M − Γ ).The stochastic errors have been normalized according to Eq. (8). For κ loop = . onfiguration T r( M - ) configuration -4-2024 Σ j T r( M - γ j γ ) / Figure 6: The zero momentum projected estimates E imp [Re Tr ( M − )] (left) and P j E imp [Re Tr ( M − γ j γ )] (right) at κ = . ∆ stoch , on di ﬀ erent gauge conﬁgurations. The scalar loop was generated investing thecost equivalent of 6 CG solves. The axial loop was evaluated at the cost of 100 solves. The large error bars on the rightof the ﬁgures correspond to unimproved estimates E . errors ∆ stoch on single conﬁgurations at κ loop = .

166 in Fig. 6. This graphical representationagain makes it evident that even for this low-cost setting the stochastic errors are much smallerthan the gauge noise. The larger error bars on the right of the ﬁgures correspond to unimproved ∆ stoch values, obtained at the same cost. After application of the improvement methods, thestochastic errors are so small that nothing extra can be gained from increasing the number ofstochastic sources (and solves) beyond these moderate values.

4. Application to ∆ q dis and to h N | ¯ qq | N i dis We now apply our methods to the calculation of observables of phenomenological interest,namely of ∆ q and h N | ¯ qq | N i . The contribution to the nucleon spin ∆ q is deﬁned through thematrix element (in Minkowski space notation), h N , s | ¯ q γ µ γ q | N , s i = m N s µ ∆ q , (22)where m N denotes the mass of the nucleon N and s µ its spin ( s µ = − ∆ q and h N | ¯ qq | N i areextracted from the ratios of three-point functions to two-point functions (at zero momentum): R dis ( t , t f ) = − Re D Γ αβ C βα ( t f ) P x Tr ( M − ( x , t ; x , t ) Γ loop ) E c D Γ αβ unpol C βα ( t f ) E c . (23)For the scalar matrix element we use, Γ = Γ unpol : = (1 + γ ) / Γ loop = . For ∆ q wecalculate the di ﬀ erence between two polarizations: Γ = γ j γ Γ unpol and Γ loop = γ j γ , wherewe average over all three possible j -orientations. The spin projection operators along the j -axis read, P ↑↓ = ( ± i γ j γ ), so that in this case, Γ = − i ( P ↑ − P ↓ ) Γ unpol , where we havetraded a factor i against taking the real part, rather than the imaginary part, of the nominatorin Eq. (23). The variance of the above expression is reduced by explicitly using the fact thatIm Tr ( M − ) = Im Tr ( M − γ j γ ) =

0. 16 t a m N , e ff Figure 7: The e ﬀ ective mass, Eq. (26), of the nucleon smeared at source and sink for κ = . t is displayedin lattice units a ≈ .

13 fm. The result of a ﬁt to the time range 3 ≤ t <

16 is shown as horizontal lines.

The two-point function of the zero momentum projected proton with sink and source spinorindices α and β is given by, C αβ ( t f ) = X x (cid:28) (cid:12)(cid:12)(cid:12)(cid:12) N α ( x , t f ) N β † ( , (cid:12)(cid:12)(cid:12)(cid:12) (cid:29) , (24)where we have set t =

0. This can be constructed from standard point-to-all quark propaga-tors [6]. Note that for q = u , d there are additional connected contributions R con , which we havenot calculated. We combine the three κ loop values with κ = .

166 and 0.1675.In the limit of large times, t f ≫ t ≫

0, in the axial case, R dis ( t , t f ) + R con ( t , t f ) → h N , s | ¯ q γ j γ q | N , s i m N = ∆ q . (25)The prefactor two comes from taking the di ﬀ erence between two opposite polarizations ratherthan ﬁxing one polarization. We note that this result will be in a lattice scheme and still needsto be multiplied by a renormalization factor of O (1), for a translation into the MS scheme. Inthe scalar case, h N | ¯ qq | N i is deﬁned as the connected contribution only and can thus be obtainedby subtracting h | ¯ qq | i = − P x Re D Tr ( M − ( x , t ; x , t ) E c from Eq. (23). We will label the discon-nected contributions to these two matrix elements as, ∆ q dis and h N | ¯ qq | N i dis , respectively. In thecase of the strange quark, ∆ s = ∆ s dis and h N | ¯ ss | N i = h N | ¯ ss | N i dis . We calculate the disconnected loop on only one timeslice t = ≈ .

38 fm / a . Having theoperator inserted close to the source ( t =

0) reduces the statistical errors but care must be taken17 t f -5-4-3-2-1012345 < N | qq | N > t f -0.1-0.08-0.06-0.04-0.0200.020.040.060.080.1 ∆ q Figure 8: The matrix elements h N | ¯ qq | N i dis (left) and ∆ q dis (right) for di ﬀ erent values of t f at κ loop = κ = . that contributions from excited states are suppressed so that Eq. (25) still holds. Following [25],we use Wuppertal smearing on top of APE smeared links at the source and sink for both the threepoint and two point functions. In Fig. 7 we display the e ﬀ ective mass, m N , e ﬀ ( t ) = a − ln C ( t ) C ( t + ! , (26)as a function of the time. This shows that with our choice of smearing parameters the excitedstate contributions are not signiﬁcant at t =

3. Hence, in the symmetric situation, t f =

6, suchcontributions should be negligible. This is even more so since the statistical error will be muchbigger than that of the two point function at t =

3. The consistency of this assumption can bechecked by varying t f ≥ h N | ¯ qq | N i dis and ∆ q dis in thelattice scheme, for di ﬀ erent ﬁnal times t f at κ loop = κ = . t = / odd preconditioned CGsolves for the scalar and at the cost of 100 such solves for the axial current. Our analysis of thetwo point function above suggests that ﬁnite t systematics should be small, relative to the statis-tical errors at t f =

6. Indeed, all t f ≥ t f = κ loop ∈ { . , . , . } and κ ∈ { . , . } combinations where the lightest κ loop value corresponds to m PS ≈

300 MeV.The ﬁve combinations that are not shown result in the same general picture, with larger statisticalerrors. For the two point function we do not realize the lightest κ value since, without low modeaveraging, this turns out to be very noisy.In Table 5 we display improved and conventional stochastic estimates of the scalar matrixelements (obtained at t = t = t f = ≈ .

76 fm / a ), for all our κ combinations. The ﬁxedcost reductions, due to TSM and HPE, in the relative errors are small and do by far not match thegain factors that we obtained in Sec. 3 for the disconnected loops alone. The noise is dominatedby taking the correlation with the two point function. For the precision of the disconnected loopto be matched by that of the two point function, the latter needs to be evaluated for multiplesource points, eventually in combination with low mode averaging [41]. We note that one doesnot encounter any computational overhead in calculating the disconnected loop at more thanone timeslice. As long as these are su ﬃ ciently separated, the stochastic errors will not increase18 loop = . κ = . κ = . h N | ¯ qq | N i disimp h N | ¯ qq | N i dis0 h N | ¯ qq | N i disimp h N | ¯ qq | N i dis0

300 1.57(43) 1.90(54)200 1.58(43) 1.91(54)100 1.62(43) 1.54(45) 1.95(53) 1.92(57)50 1.65(42) 1.68(46) 2.00(54) 2.00(61)25 1.62(42) 1.65(51) 1.99(54) 2.15(71)12 1.48(44) 1.85(51) 1.93(59) 2.44(71)6 1.36(44) 1.79(62) 1.81(57) 2.19(79) κ loop = . κ loop = . ∗ ∗ ∗ ∗ Table 5: The disconnected contribution to the scalar matrix element in the lattice scheme, evaluated at di ﬀ erent costequivalents. At κ loop = . P ) is employed. In the asterisked rows the costs of generating the eigenvectorswere neglected. (am PS ) -5-4-3-2-1012345 < N | qq | N > d i s (am PS ) -0.1-0.08-0.06-0.04-0.0200.020.040.060.080.1 ∆ q d i s Figure 9: h N | ¯ qq | N i dis (left) and ∆ q dis (right) as functions of the quark mass used in the disconnected loop (expressed interms of am ). The open squares correspond to a proton with κ = . κ = . loop = . κ = . κ = . ∆ q disimp ∆ q dis0 ∆ q disimp ∆ q dis0

300 -0.025 (9) -0.023(17)200 -0.026 (9) -0.028(18)100 -0.020(11) -0.020(30) -0.017(21) 0.015(54)50 -0.021(13) 0.001(39) -0.024(26) 0.052(73) κ loop = . κ loop = . ∗ -0.026(18) -0.008(37)25 ∗ -0.016(20) 0.011(38) Table 6: Disconnected contribution to ∆ q in the lattice scheme, evaluated at di ﬀ erent cost equivalents. At κ loop = . P ) is employed. In the asterisked rows the costs of generating the eigenvectors were neglected. signiﬁcantly.In Table 6 we display the same information as in Table 5, for ∆ q dis in the lattice scheme. Inthis case, at the cost equivalent of 100 estimates (90 estimates at κ loop = . t f = m PS ≈

600 MeVdown to 450 MeV, or on the loop quark mass, reducing m PS ≈

600 MeV (strange quark mass) to m PS ≈

300 MeV. We ﬁnd ∆ s = − . ∆ s = − . ﬀ erent sea quark and gluon actions, we would not expect ∆ s to changeby more than a factor of 0.7 after a translation into the MS scheme.

5. Conclusions and outlook

A growing number of Lattice QCD applications requires all-to-all propagators. These aremost e ﬃ ciently estimated stochastically. We presented the novel truncated solver method (TSM)that typically reduces the computer time to achieve a given stochastic variance by an order of20agnitude, without introducing a bias. The gain factors of this method for di ﬀ erent observableswere demonstrated not to vary when changing the quark mass by a factor of four (600 MeV ≥ m PS ≥

300 MeV). The TSM is easy to implement and can be used for any fermionic action.The combination of TSM with di ﬀ erent variance reduction techniques is straight forward andthis reduces the stochastic variance even further. We studied in detail combinations of TSM,partitioning [18], the hopping parameter expansion [23] and low eigenmode deﬂation [26, 25].In realistic Lattice QCD simulations there are intrinsic errors from the Monte Carlo timeseries, in addition to the errors introduced by the stochastic estimation of the inverse of thefermionic matrix on individual conﬁgurations. We studied the interplay between these gaugeand stochastic noises. After reducing the stochastic variance by a combination of methods, inour calculation of disconnected contributions to the nucleon structure, the gauge errors becamethe dominant sources of uncertainty. This means that we can a ﬀ ord larger stochastic errorsand for instance increase the lattice volume, without having to increase the number of randomsources. For instance, on our 2 fm lattices, even at a cost of only 6 CG solves, the stochasticerror of the scalar matrix element h N | ¯ qq | N i dis is completely over-shadowed by its gauge error: incertain situations one can overdo the reduction of the stochastic noise. In this particular case thetotal error can more e ﬃ ciently be reduced by increasing the number of nucleon sources on eachconﬁguration than by increasing the number of (improved) estimates. Also the determination of ∆ s will beneﬁt from this. At present we are pursuing such an approach.Our result on the strangeness contribution to the nucleon spin ∆ s ≈ t [60, 61, 62]. These suggested a value ∆ s ≈ − .

12. In the case of the scalar matrix element our errors are still too large to state ameaningful number, in particular since chiral and inﬁnite volume behaviours need to be studied.The techniques developed here are used by us in an ongoing study [63] at smaller lattice spacing,quark masses and larger volumes with Sheikholeslami-Wohlert sea and valence quarks.

Acknowledgements

We thank Zoltan Fodor and Kalman Szabo for providing us with the gauge conﬁgurationsand Andreas Frommer for discussions. Our computations were performed on the RegensburgLinux cluster and the IBM p6 cluster (JUMP), BlueGene / L (JUBL) and BlueGene / P (JUGENE)of the J¨ulich Supercomputer Center. Sara Collins acknowledges support from the Claussen-Simon-Foundation (Stifterverband f¨ur die Deutsche Wissenschaft). This work was supported bythe DFG Sonderforschungsbereich / Transregio 55.21

20 40 60 80 100 n t f n t f Figure 10: f and f (see Eq. (A.1)) as functions of n t for Γ = γ γ at κ loop = . n t t f f /[n c (f’ ) ] n t N / N Figure 11: Results used for the calculation of the optimal n t and N / N values for Γ = γ γ at κ loop = . Appendix A. TSM parameter tuning for the CG and BiCGstab2 solvers

The truncated solver method of Sec. 2.3 depends on two parameters, namely on the number ofiterations for the inexact solves, n t , and on the ratio N / N of the number of inexact estimates overthe number of estimates of the correction term, see Eq. (13). These parameters need to be ﬁxed,ideally, so as to minimize the variance of the estimate of the disconnected loop, E[Tr (M − Γ )], atﬁxed cost. The estimates are uncorrelated and for N , N ≫ − Γ )] = var (cid:18) h η | γ Γ | s ( n t ) i N (cid:19) + var (cid:18) h η | γ Γ ( | s i − | s ( n t ) i ) N (cid:19) = f N + f N , (A.1)where f and f depend on n t and Γ . An example of the dependence on n t is shown in Fig. 10for Γ = γ γ for the CG and the BiCGstab2 solvers. f is roughly independent of n t and f decreases rapidly with increasing n t . This behaviour is expected since after a few iterations theﬁrst term of Eq. (13) contains most of the signal (and its error) while the second term (and itserror) approaches zero. The results were generated on a single conﬁguration at κ loop = . n conv ≈

480 CG iterations.22 t Γ = N / N Γ = γ γ N / N

20 24 48 81 93 109 254 305 288 566gain 6.1 6.2 4.4 5.2 7.2 7.3 6.9 6.7 6.8 4.8

Table 7: Gains obtained for various n t , setting N / N = f / f , at κ loop = . The approximate cost is given by,cost ≈ N n t + N n conv . (A.2)When TSM is combined with HPE and / or TEA, there are corrections to this formula which thereader can easily work out. Using Lagrange multipliers, we minimize the variance Eq. (A.1) atﬁxed cost, assuming f to be approximately independent of n t . This yields the optimal values, n optt = n conv f f f ′ , (A.3) N N = s f f n conv n optt , (A.4)where f ′ = df / dn t .The right hand sides of Eqs. (A.3) and (A.4) can be computed as functions of n t on oneconﬁguration, using a single set of stochastic sources. By ﬁnding the intersection between thecurve f f / ( n conv f ′ ) and n t , one can extract n optt and subsequently determine N / N for this n t .Fig. 11 illustrates this procedure for Γ = γ γ for the CG solver. We read o ﬀ the values n optt ≈ N / N ≈

30 from this ﬁgure, see Table 1.For our observables we ﬁnd that when using the optimized TSM parameter values, the twovariances within Eq. (A.1) are of similar sizes, f / N ≈ f / N . It also turns out that the gain factordoes not critically depend on N / N . For instance, increasing this ratio from the optimal value of30 found for Γ = γ γ at n t =

18 to the equal cost value N / N = f / f ≈

35 will increase the ﬁnalvariance by just 3 %, which is statistically insigniﬁcant. This suggests an alternative criteriumfor ﬁxing the parameters: scanning through n t , keeping N / N = f / f ﬁxed, to determine the n t value with the smallest combined variance. We ﬁnd that following this strategy reduces no gainfactor by more than 2 %, relative to the gain achieved using the optimal values, at κ = . n t whenusing the BiCGstab2 solver, so that f ′ cannot be determined. The situation is even worse for Γ = . In the BiCGstab2 case we have to resort to the alternative method discussed above.Table 7 demonstrates that, using this approach, one can indeed ﬁnd values n t and N / N forthe BiCGstab2 solver that give reasonable gains. The cost was kept ﬁxed to correspond to 300BiCGstab2 solves to convergence. In this case n conv ≈

156 is by a factor three smaller than forthe CG, however, each iteration is about twice as expensive. The best gain factors are 5.0 for23 = and 7.3 for Γ = γ γ while for CG, using this method, we are able to achieve very similarfactors of 4.9 and 7.9, respectively.The CG algorithm is more robust than BiCGstab2 and gives nearly optimal gains over a widerrange of n t values. Note for instance the somewhat erratic behaviour in Table 7 of BiCGstab2 at n t =

10. Moreover, the optimal N / N ratios come out rather large, due to tiny f values, whichturns BiCGstab2 less optimal when it is combined with HPE or TEA. Therefore, in the contextof TSM, the CG solver is our preferred choice. However, it is possible to combine both solvers,using CG for the truncated solves and BiCGstab2 for running to convergence more e ﬃ ciently. References [1] K. Jansen, C. Michael and C. Urbach [ETM Collaboration], The η ′ meson from Lattice QCD, Eur. Phys. J. C 58(2008) 261, arXiv:0804.3871 [hep-lat].[2] J. Bulava, R. Edwards, K. J. Juge, C. J. Morningstar and M. J. Peardon [Hadron Spectrum Collaboration], Multi-hadron operators with all-to-all quark propagators, PoS (LATTICE 2008) 124, arXiv:0810.0730 [hep-lat].[3] S. Aoki, M. Fukugita, K-I. Ishikawa, N. Ishizuka, K. Kanaya, Y. Kuramashi, Y. Namekawa, M. Okawa, K. Sasaki,A. Ukawa, T. Yoshi´e [CP-PACS Collaboration], Lattice QCD calculation of the ρ meson decay width, Phys. Rev.D 76 (2007) 094506, arXiv:0708.3705 [hep-lat].[4] J. M. Zanotti, Investigations of hadron structure on the lattice, PoS (LATTICE 2008) 007,arXiv:0812.3845 [hep-lat].[5] M. Deka T. Streuer, T. Doi, S. J. Dong, T. Draper, K.-F. Liu, N. Mathur and A. W. Thomas, Moments of nu-cleon’s parton distribution for the sea and valence quarks from Lattice QCD, Phys. Rev. D 79 (2009) 094502,arXiv:0811.1779 [hep-ph].[6] T. DeGrand and C. DeTar, Lattice methods for quantum chromodynamics, (World Scientiﬁc, Singapore, 2006).[7] A. Airapetian et al. [HERMES Collaboration], Measurement of parton distributions of strange quarks in the nu-cleon from charged-kaon production in deep-inelastic scattering on the deuteron, Phys. Lett. B 666 (2008) 446,arXiv:0803.2993 [hep-ex].[8] S. L. Zhu, G. Sacco and M. J. Ramsey-Musolf, Recoil order chiral corrections to baryon octet axial currents andlarge N c QCD, Phys. Rev. D 66 (2002) 034021, arXiv:hep-ex / et al. [HERMES Collaboration], Precise determination of the spin structure function g of the proton,deuteron and neutron, Phys. Rev. D 75 (2007) 012007, arXiv:hep-ex / / / ﬀ , N. Eicker, S. G¨usken, H. Hoeber, P. Lacock, T. Lippert, G. Ritzenh¨ofer, K. Schilling, A. Spitz, P.Ueberholz [SESAM Collaboration], Improving stochastic estimator techniques for disconnected diagrams, Nucl.Phys. B (Proc. Suppl.) 63 (1998) 269, arXiv:hep-lat /

22] A. O’Cais, J. Foley, K. Jimmy Juge, M. Peardon, S. M. Ryan and J. I. Skullerud [TrinLat Collaboration], Improvingalgorithms to compute all elements of the lattice quark propagator, Nucl. Phys. B (Proc. Suppl.) 140 (2005) 844,arXiv:hep-lat / / Z estimator of determinants, Phys. Rev. D 57 (1998) 1642,arXiv:hep-lat / / ﬀ , Z. Prkacin and K. Schilling [SESAM Collaboration], String breakingwith dynamical Wilson fermions, Nucl. Phys. B (Proc. Suppl.) 140 (2005) 609, arXiv:hep-lat / ﬀ and K. Schilling [SESAM Collaboration], Observation of string breakingin QCD, Phys. Rev. D 71 (2005) 114513, arXiv:hep-lat / / / η ′ - η c -mixing with improved stochastic estimators, PoS (LATTICE 2008) 114,arXiv:0903.2947 [hep-lat].[34] S. G¨usken, Flavor singlet phenomena in Lattice QCD, arXiv:hep-lat / / / ﬀ [SESAM Collaboration], Evaluating sea quark contributions to ﬂavour-singlet operatorsin lattice QCD, Phys. Lett. B 389 (1996) 720, arXiv:hep-lat / ﬀ , N. Eicker, T. Lippert, J. W. Negele and K. Schilling, On the low fermionic eigenmode dominance in QCDon the lattice, Phys. Ref. D 64 (2001) 114509, arXiv:hep-lat / / / / / ﬀ , K. Schilling and W. Schroers, Instanton dominance of topological charge ﬂuctuations inQCD?, Phys. Rev. D 65 (2002) 014506, arXiv:hep-lat / / / / / UKQCD Collaboration], The pion’s electromagnetic form factor at small momentum transfer in full Lattice QCD,JHEP 0807 (2008) 112, arXiv:0804.3971 [hep-lat].[51] A. Duncan and E. Eichten, Improved pseudofermion approach for all-point propagators, Phys. Rev. D 65 (2002)114502, arXiv:hep-lat / / / / // / ∼ edwards / qcdapi / reports / dslash p4.pdf.[56] P.A. Boyle, http: // / ∼ paboyle / bagel / Bagel.html (2005).[57] L. Del Debbio, L. Giusti, M. L¨uscher, R. Petronzio and N. Tantalo, Stability of Lattice QCD simulations and thethermodynamic limit, JHEP 0602 (2006) 011, arXiv:hep-lat / / g A from Lattice QCD, Phys. Rev. Lett. 75 (1995) 2096,arXiv:hep-ph / ﬀ , N. Eicker, T. Lippert, K. Schilling, A. Spitz and T. Struckmann [T χ L Collaboration], Flavorsinglet axial vector coupling of the proton with dynamical Wilson fermions, Phys. Rev. D 59 (1999) 114502,arXiv:hep-lat /9901009.[63] G. S. Bali, S. Collins and A. Sch¨afer [QCDSF Collaboration], Strangeness and charm content of the nucleon, PoS(LAT 2009) 149, arXiv:0911.2407 [hep-lat].