[PDF] Efficient computation of baryon interpolating fields in Lattice QCD

Abstract

In this work we present an efficient construction of baryon interpolating fields for lattice QCD computations of two and three point functions. These are essential building blocks of computations of nucleon parton distribution functions (PDFs), generalized parton distribution functions (GPDs) and transverse momentum dependent distributions functions (TMDs). Lattice QCD computations of these quantities can provide additional input to assist with the global fits on experimental data for determining TMDs, GPDs and PDFs.

Full PDF

aa r X i v : . [ h e p - l a t ] N ov Efﬁcient computation of baryon interpolating ﬁelds in Lat-tice QCD

Eloy

Romero , ∗ and Kostas

Orginos , , ∗∗ Department of Computer Science, The College of William & Mary, USA Department of Physics, The College of William & Mary, USA Je ﬀ erson Laboratory, USA Abstract.

In this work we present an e ﬃ cient construction of baryon inter-polating ﬁelds for lattice QCD computations of two and three point functions.These are essential building blocks of computations of nucleon parton distri-bution functions (PDFs), generalized parton distribution functions (GPDs) andtransverse momentum dependent distributions functions (TMDs). Lattice QCDcomputations of these quantities can provide additional input to assist with theglobal ﬁts on experimental data for determining TMDs, GPDs and PDFs. A vital component of our long term project of determining hadronic structure from lat-tice QCD is the ability to compute a large class of matrix elements with high precision, bothstatistical and systematic. To that extent, we plan to capitalize on distillation [1], for con-structing suitable interpolating ﬁelds for the nucleon that has already been very successful inspectroscopy computations.The basic idea of distillation is to restrict the operators to a small subspace (the distilla-tion basis) containing substantial contributions of the relevant eigenstates. The reduction inthe rank of the operators dramatically cuts down the cost of computing all elements of thepropagation matrix, allowing for measuring more complex hadron correlation functions.Still, the amount of computation and storage overgrows with the lattice size N and therank of the distillation basis, n . The optimal rank of the distillation basis is determinateexperimentally, but it is roughly proportional to the volume of the spatial dimensions of thelattice. We give an idea of the costs by showing the amount of ﬂoating-point operationsand the footprint memory requirements in big-O notation considering a 4D lattice that alldimension have equal size in which the volume of the spatial dimensions is proportional tothe distillation basis rank, n . In these terms, the most expensive parts of the computations arethe formation of the matrix elements, which are tensors generated from contracting matrices(basis), requiring roughly n . operations for mesons (two matrices are contracted) and n . for baryons (three matrices are contracted). The perambulators are square matrices generatedby projecting the inverse of the Dirac operator and require n . operations. At the ﬁnal step ofthe computation, matrix elements and perambulators are contracted together and that requires n and n operations for mesons and baryons respectively. Table 1 details the time that eachtask takes for baryons in the example used along within this document.In this project, we focus on accelerating the generation of baryon elementals, whose timedominates over the rest of the tasks. Despite the apparent simplicity of tensor contractions, ∗ e-mail: [email protected] ∗∗ e-mail: [email protected] able 1. Asymptotic computational cost, reference time per conﬁguration and time-slice source, andsoftware involved in the time-consuming tasks in estimating the two point correlation functions for alattice 32 ×

64 in a single femto[4]’s node.

Operations Memory ExampleComputation cost footprint time Main librariesDistillation basis n . n . n . n .

16 h HaromPerambulators 100 n . n . / mg_protoContractions n n . ﬀ orts have to be spe-cialized for the characteristics of the tensors and the computing device. Unlike the multi-plication of matrices, few libraries, such as CFT[2] and libtensor[3], provide a powerful andﬂexible way to specify the tensor contractions if one is willing to sacriﬁce some performance.The ﬁrst optimization that we propose does not reduce the number of operations, butinstead reorder the operations to minimize memory requests and operands dependency, in-creasing the utilization of the many arithmetic-logic units available on modern CPUs. Byrelying on high-performance implementations of matrix-matrix multiplication, we take ad-vantage of optimizations speciﬁc to the computing device and the dimensions of the tensorsalready available in these libraries. The second approach addresses not only the computa-tional time but also the demanding I / O requirements of the calculation. We have explored atechnique that ﬁnds a sparse approximate representation of the distillation basis in a way thatthe baryon elementals are also sparse.

BLAS acceleration of baryon elementals

To explain the caveats of the approach, we introduce a simpliﬁed description of the actualtensor contractions that correspond to the computation of the baryon elementals for a par-ticular time-slice, gauge conﬁguration, and displacements combination. The operands of thecontraction include • three 5D tensors (which can be the distillation basis or an operator acting on the basis), v , w , y , with components on the 3D space lattice L N of dimension N , the three color spacecomponents C = , ,

3, and the last index is the distillation basis column from 1 to n ; • a 4D tensor z (which is a phase, z ( x , l ) = e − i p · x ), which also has components on L N and anindex for the momenta; and ﬁnally, • the color contraction is indicated with ǫ , which is a sparse, antisymmetric tensor represent-ing the color contraction, ǫ (1 , , =

1, and ǫ ( α,α,β ) = ǫ ( α,β,γ ) = − ǫ ( β,α,γ ) = − ǫ ( α,γ,β ) .The tensors are operated as follows: B ( i , j , k , l ) = X x ∈ L N , α,β,γ ∈ C ǫ ( α,β,γ ) v ( x ,α, i ) w ( x ,β, j ) y ( x ,γ, k ) z ( x , l ) , for 1 ≤ i , j , k ≤ n , ≤ l ≤ M . (1)We studied all alternatives for implementing the tensor contraction Eq. (1) by groupingthe operands into two groups. Table 2 shows all the relevant possibilities. For instance,creating the tensor with ǫ , v , w , and y and then contracted with z requires the minimumnumber of ﬂoating-point operations. However, its performance is bounded by the memorybandwidth of the computing node. In modern CPUs with many arithmetic-logic units, thegrouping with better performance, despite doing three times more ﬂoating-point operations able 2. Asymptotic extra auxiliary memory and ﬂoating-point operations (FLOPs) in computing thetensor contraction at Eq. (1) depending on how the operands are grouped, for a lattice of volume N ,and basis v , w , y of rank n , and basis z of rank M . Grouping Aux. memory FLOPs ǫ v w y z M n N ( ǫ v ) ( w y z ) 6 M n N M n N ( ǫ v w ) ( y z ) 3 n N M n N ( ǫ v w y ) ( z ) n N M n N Table 3.

Performance comparison in time ﬂoating-point operations per seconds (GFLOPS) of theoriginal version (Grouping ( ǫ v w y ) ( z )) and the new implementation (Grouping ( ǫ v w ) ( y z )) incomputing the baryon elementals for a conﬁguration with a lattice 32 ×

64, 19 momenta, and 18displacements on a femto’s node. Reported times do not include reading / writing disk operations.Maximum peak performance for a femto’s node is 1800 GFLOPs, and bounded by memory bandwidthis 40 GFLOPS. Grouping ( ǫ v w y ) ( z ) Grouping ( ǫ v w ) ( y z ) n Time (s) GFLOPS Time (s) GFLOPS32 4,923 38 529 1,06664 33,191 45 4,048 1,115128 – 29,219 1,235256 – 162,542 1,776than the previous variant, contracts the auxiliary tensor f , formed with ǫ , v , and w , with thetensor g , formed with y and z .For contracting the tensors f and g , we propose to bypass most of the e ﬀ ort of developingand maintaining a high-performance tensor contraction by relying on optimized BLAS li-braries for computing matrix-matrix multiplications, such as OpenBLAS and MKL. The useof BLAS is a state-of-the-art practice in tensor contraction on CPUs[5] and GPUs[6].We developed the implementation inside the library harom , which is part of the softwaresuite for spectroscopy at Je ﬀ erson Laboratory. Like the rest of the harom’s code, our im-plementation supports shared memory (with OpenMP) and distributed memory (with MPI)paradigms. In harom, the lattice dimensions of the operands are distributed among the pro-cesses. So our code ﬁrst contracts the local part of the basis v , w , y , and z , and, in the end,a single global reduction adds the partial results at every process. If threading is used, thethreads independently work on di ﬀ erent partitions of the i , j indices of the baryon elemental.We have tested the performance of the new implementation mostly on Intel Skylake andPhi processors. The results show that the new implementation is ten times faster on averagethan the original code in computing the baryon elementals. On Tab. 3, we report the resultson femto [4], a cluster at The Collage of William & Mary. As the range of the distillationbasis increases, the overheads of copying back and forth from the harom representation ofthe tensors to the formats imposed by BLAS diminishes, and the performance of the newimplementation gets closer to the peak performance of the node.A new bottleneck has appeared after the drastic reduction in computing the baryon el-ementals: the time for writing the baryon elementals on the global ﬁle systems starts todominate. For a distillation basis with a n =

256 rank, the time that takes writing the baryonelementals on disk is double that the time expended in computing them. The approach that https: // github.com / Je ﬀ ersonLab / harom e introduce in the following aims at reducing not only the computational time but also thestorage of the baryon elementals. Blocked distillation basis

The distillation basis consists of the eigenvectors from the lower part of the spectrum of aLaplacian-like operator ∇ , ∇ x , y ) = δ x , y − X j ∈{ (1 , , , (0 , , , (0 , , } U ( x ) j δ x + j , y + U ( x − j ) j † δ x − j , y , where U is the gauge ﬁeld restricted to a particular time-slice that may have been smeared,which is not relevant in this context. Like the eigenvectors of the Laplacian, those eigen-vectors have common components on the local scale [7]. The use of a local support basis toapproximate the lower spectrum is a critical ingredient in the construction of the prolongatorand restrictor operators in multigrid [8, 9]. Also, it has been used to compress the eigenvec-tors [10].The baryon elementals generated from this local-supported, sparse bases can be computedfaster, and the resulting tensors are sparse also. These sparse tensors are faster to write ondisk and to contract together with other tensors.The new approach for generating the distillation basis of rank n in a time-slice is asfollowing. First, we divide the lattice in d equally sized domains and restrict f ( ∇ ) to eachof the domains, where f is function on ℜ whose output is non-negative real numbers. Thenew basis is composed by taken n / d eigenvectors from each subdomain with the largesteigenvalues. The function f controls the weight of the eigenvectors. For instance, if f isconstant, all eigenvectors matter equally.We avoid the evaluation of f ( ∇ ) by working with a truncated approximate eigendecom-position of f ( ∇ ), f ( ∇ ) V = V Λ . Then the eigendecomposition of V f ( ∇ ) V † restricted toeach subdomain s , V s † f ( Λ ) V s W s = W s ˜ Λ s is related to the singular value decomposition of V s f ( Λ ): (cid:16) V f ( Λ ) V † (cid:17) s w si = σ si w si ⇔ V s f ( Λ ) = W s Σ s Q s † . (2)Picking more directions on each subdomain than n / d increases the overlap between theresulting basis w i and the original basis v i , although only n directions from the over-ranked w i basis are going to be used as the distillation basis, W ρ , with ρ † ρ = I n , which are theorthogonalization of the projections of v i onto the subspace spanned by w i . In other words, ρ is the Q -factor of the QR factorization of W † V .The function f plays the role of weighing the importance of every direction. To reinforcethe presence of the smallest eigenvalues of ∇ into the new basis, f should be decreasing if welook for the largest singular values of V s f ( Λ ). We have tested a simple family of functionswith a single tunable parameter, α , f ( λ ) = exp( αλ ). Figure 1 shows how, as α increases, theinterior eigenvectors get more presence in detriment of the smallest eigenvectors.When neglecting the perambulators, the hadron correlation functions are reduced to thecorrelation between baryon elementals at di ﬀ erent time-slices. If B ( v ) and B ′ ( v ) are twobaryon elementals at di ﬀ erent time slices, then we want the correlations between B ( v ) and B ′ ( v ), and between B ( w ) and B ′ ( w ) to be highly correlated. In this way, we propose the fol-lowing simple heuristic to tune α based on the transitive property of correlations for highlycorrelated data: to maximize the average correlation between the baryon elementals gener-ated with the original basis and with the new basis for a few time-slices. In principle, thecorrelation between the baryon elementals is not invariant under rotations of the basis, unlike . . i -th column of the original basis, v i c o s ∠ ( W α , v i ) α = − . α = − . α = − . α = − . α = − . Figure 1.

Cosine of the angles between each of the columns of the original basis, v i , and the generatedbasis W α with weight function f ( λ ) = exp( αλ ) for di ﬀ erent values of α . The spatial dimension of thelattice is 32 , blocking 2 in every direction, starting with n ′ =

96 and picking 128 directions ( b = − −

20 00 . . α used to generate w with rank 64 C o rr e l a ti onb e t w ee n B ( v ) a nd B ( w ) ~ p = [0 0 0] ~ p = [1 0 0] ~ p = [2 0 0] ~ p = [3 0 0] 1 8 16 32 640 . .

91 Rank of the distillation bases v and w C o rr e l a ti onb e t w ee n B ( v ) a nd B ( w ) ~ p = [0 0 0] ~ p = [1 0 0] ~ p = [2 0 0] ~ p = [3 0 0] Figure 2.

Average correlation between the baryon elementals generated with the original distillationbasis v and the blocked basis w with weight function f ( λ ) = exp( αλ ), ﬁxing either both bases rank to64 (left) or the parameter α to the optimal value of − . B ( v ) ( i , j , k ) = P x ,α,β,γ ǫ ( α,β,γ ) v ( x ,α, i ) v ( x ,β, j ) v ( x ,γ, k ) e − i ( ~ p , x ) . the actual hadron correlation functions. We address this issue in part by selecting ρ insteadso that W ρ is a close representation of v i , ρ = argmin ρ † ρ = I n k V − W ρ k F = ˆ U ˆ V † , where ˆ U ˆ Σ ˆ V † is the singular value decomposition of V † W .We found that the average correlation function between baryons of both bases is concaveand smooth on α , as Fig. 2 (left) shows. Also, the optimal value of α seems to depend onneither the momenta nor the displacements of the baryon elemental, and the spectra of ∇ fordi ﬀ erent time-slices and conﬁgurations are similar enough to use the same value of α for alltime-slices. However, we recommend retuning α when changing the rank of the distillationbasis as the correlation quickly drops as the rank increases, as Fig. 2 (right) shows.

10 20 3010 − − Distance in lattice units (r)Average Ψ ( v , r ) for | r | = r Average Ψ ( w , r ) for | r | = r −

10 0 10 − x coordinate y c oo r d i n a t e Average Ψ ( v , r ) × − for r = [ x , y, −

10 0 10 − x coordinate y c oo r d i n a t e Average Ψ ( w , r ) × − for r = [ x , y, −

10 0 10 − . . . . . . . . . . . x coordinate y c oo r d i n a t e Average Ψ ( v , r ) × − for r = [ x , y, −

10 0 10 − . . . . . . . . . . . . . x coordinate y c oo r d i n a t e Average Ψ ( w , r ) × − for r = [ x , y, Figure 3.

Comparison of the spatial distribution of the distillation basis Ψ of the original basis v andthe blocked basis w in function of the distance (top), and at particular directions (bottom). Moreover, we tested the spatial distribution of the blocked basis. For that, we study Ψ ( r )as done in [1], Ψ ( v , r ) = X x ∈ L N s X α ∈ C , ≤ i ≤ n v ( x ,α, i ) v ( x + r ,α, i ) † , (3)which measures the degree of smearing on the ﬁeld after restricting it to the distillation basis.At short distances, both bases seem to keep the rotational symmetry (see Fig 3 top). At longdistances, the blocked basis breaks rotational symmetry more than the original basis, as Fig. 3bottom shows.Finally, we present the e ﬀ ective masses using the original basis and the blocked basisin Fig. 4 up to momentum 3. Energies computed with both bases are in agreement withinstatistical error. Thus we do not detect systematic errors introduced by the new basis due to . . λ original χ / N dof = . m t = = . ± . . . λ blocked χ / N dof = . m t = = . ± . . . λ original χ / N dof = . m t = = . ± . . . λ blocked χ / N dof = . m t = = . ± . . . λ original χ / N dof = . m t = = . ± . . . λ blocked χ / N dof = . m t = = . ± . . . λ original χ / N dof = . m t = = . ± . t / a . . λ blocked χ / N dof = . m t = = . ± . t / a Figure 4. E ﬀ ective masses for time-slice shift of 5 using the original distillation basis (left) and theblocked basis (right) up to momenta 3. the more pronounced breaking of rotational symmetry. However, the new basis introducesmore phisical uncertenties in estimating the mass at higher momenta.To verify the performance beneﬁts, we also developed a prototype code that generates thebaryon elementals exploiting the sparsity patterns on the operands of the tensor contraction.The challenge in this development is controlling the overhead of manipulating the spare rep-resentation of tensors. We propose to store the nonzero blocks of the distillation basis and thetensors in a similar way as the Block Sparse Row format does for matrices (see a descriptionat [11]).It may be still useful to accelerate the contraction of tensors by the BLAS’ matrix-matrixmultiplication applied to the contraction of the nonzero tensor blocks. Although, some e ﬀ ortis required to reduce the overhead of calling the BLAS function with small matrices. This isespecially dramatic for the baryon elementals with displacement, which have a more intricatenonzero pattern of small-size blocks. We got some improvement by static linking with BLASand choosing the layout of the tensor’s matriﬁcation carefully to improve data locality. Butthere is probably still room for improvement. We got a speedup of 2.8 in time, which is farfrom the expected speedup of 51 if the ﬂoat-point operations limit the performance. However, able 4. Expected and measured speedups in the time for computing and writing on the globalﬁlesystem the baryon elementals.

Time speedup Storage reductionBaryon elemental, B ( v ) ( i , j , k ) = P x ,α,β,γ ǫ ( α,β,γ ) e − i ( ~ p , x ) · · · Expected Measured Expected Measured · · · v ( x ,α, i ) v ( x ,β, j ) v ( x ,γ, k )

64 5.2 8 8 · · · v ( x ,α, i ) v ( x ,β, j ) ( D ~ x v ) ( x ,γ, k ) for ~ x ∈ { ~ , ~ , ~ }

50 3.0 4 4 · · · v ( x ,α, i ) v ( x ,β, j ) ( D ~ x D ~ x v ) ( x ,γ, k ) for ~ x ∈ { ~ , ~ , ~ }

50 2.7 4 4 · · · v ( x ,α, i ) v ( x ,β, j ) ( D ~ y D ~ x v ) ( x ,γ, k ) for ~ x , ~ y ∈ { ~ , ~ , ~ }

50 2.6 2 2 · · · v ( x ,α, i ) ( D ~ x v ) ( x ,β, j ) ( D ~ x v ) ( x ,γ, k ) for ~ x ∈ { ~ , ~ , ~ }

50 2.8 2 2 · · · v ( x ,α, i ) ( D ~ y v ) ( x ,β, j ) ( D ~ x v ) ( x ,γ, k ) for ~ x < ~ y ∈ { ~ , ~ , ~ }

50 2.8 2 2Average: 51 2.8 3 3 the reduction in the time spent in writing the elementals on the global ﬁlesystem is 2.9, asexpected. The detailed results are shown on Tab. 4.

Conclusions

In this work, we propose two techniques to reduce the leading computational costs in esti-mating baryon correlation functions as the rank of the distillation basis increases, costs thatare dominated by the generation and storage of the baryon elementals.First, we study the performance of generating baryon elementals, which consists of thetensor contraction of the distillation basis, and we propose an implementation that maximizesthe CPU utilization. Most of the ﬁne-tuning is avoided by relying on a high-performanceBLAS library. Second, we propose to approximate the distillation basis in a spatial localsupport basis (blocked basis) by exploiting the local coherence of the lattice Laplacian’seigenvectors. The new basis generates very sparse baryon elementals that are more e ﬃ cientto compute and store, but also contaminates the basis with higher modes that may increasethe statistical uncertainty in estimating the correlation functions. Further research will eval-uate the e ﬀ ects of the new basis at higher statistical accuracy and on other hadronic states,including excited multihadron states.Although the asymptotic costs remain the same, the resulting reduction in costs pushes abit further the practical limits in the lattice volume and the distillation basis’ rank employedin these calculations. Acknowledgements

This work was primarily supported by the center for nuclear femtography. The gauge ﬁeldconﬁgurations used in this work are from Je ﬀ erson Laboratory. The Chroma software suite[12] together with Qphix / mg_proto [13] and PRIMME [14] have been also used in this work.This work was performed in part using computing facilities at William & Mary which wereprovided by contributions from the National Science Foundation (MRI grant PHY-1626177),the Commonwealth of Virginia Equipment Trust Fund and the O ﬃ ce of Naval Research. Inparticular, the majority of this work was performed on the Femto cluster at William & Mary.

References [1] M. Peardon, J. Bulava, J. Foley, C. Morningstar, J. Dudek, R.G. Edwards, B. Joó,H.W. Lin, D.G. Richards, K.J. Juge (Hadron Spectrum Collaboration), Phys. Rev. D , 054506 (2009)2] E. Solomonik, D. Matthews, J.R. Hammond, J.F. Stanton, J. Demmel, Journal of Paral-lel and Distributed Computing , 3176 (2014)[3] E. Epifanovsky, M. Wormit, T. Ku´s, A. Landau, D. Zuev, K. Khistyaev, P. Manohar,I. Kaliman, A. Dreuw, A.I. Krylov, Journal of Computational Chemistry , 2293(2013), https://onlinelibrary.wiley.com/doi/pdf/10.1002/jcc.23377 [4] William & Mary cluster femto , [5] E.D. Napoli, D. Fabregat-Traver, G. Quintana-Ortí, P. Bientinesi, Applied Mathematicsand Computation , 454 (2014)[6] A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar,I. Karlin, T. Kolev, I. Masliah et al., Procedia Computer Science , 108 (2016), inter-national Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, SanDiego, California, USA[7] M. Lüscher, Journal of High Energy Physics , 081 (2007)[8] J. Brannick, R.C. Brower, M. Clark, J.C. Osborn, C. Rebbi, Physical review letters ,041601 (2008)[9] R. Babich, J. Brannick, R.C. Brower, M.A. Clark, T.A. Manteu ﬀ el, S.F. McCormick,J.C. Osborn, C. Rebbi, Phys. Rev. Lett. , 201602 (2010)[10] K. Clark, C. Jung, C. Lehner, PoS The 35th International Symposium on LatticeField Theory, LAT2017 (2017)[11] Y. Saad,

SPARSKIT: a basic tool kit for sparse matrix computations (1994)[12] R.G. Edwards, B. Joó, Nuclear Physics B - Proceedings Supplements , 832–834(2005)[13] B. Joó, D.D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V.W. Lee,P. Dubey, W. Watson,

Lattice QCD on Intel ® Xeon PhiTM Coprocessors , in

Supercom-puting , edited by J.M. Kunkel, T. Ludwig, H.W. Meuer (Springer Berlin Heidelberg,Berlin, Heidelberg, 2013), pp. 40–54, ISBN 978-3-642-38750-0[14] A. Stathopoulos, J.R. McCombs, ACM Transactions on Mathematical Software37