Efficient computation of baryon interpolating fields in Lattice QCD
aa r X i v : . [ h e p - l a t ] N ov Efficient computation of baryon interpolating fields in Lat-tice QCD
Eloy
Romero , ∗ and Kostas
Orginos , , ∗∗ Department of Computer Science, The College of William & Mary, USA Department of Physics, The College of William & Mary, USA Je ff erson Laboratory, USA Abstract.
In this work we present an e ffi cient construction of baryon inter-polating fields for lattice QCD computations of two and three point functions.These are essential building blocks of computations of nucleon parton distri-bution functions (PDFs), generalized parton distribution functions (GPDs) andtransverse momentum dependent distributions functions (TMDs). Lattice QCDcomputations of these quantities can provide additional input to assist with theglobal fits on experimental data for determining TMDs, GPDs and PDFs. A vital component of our long term project of determining hadronic structure from lat-tice QCD is the ability to compute a large class of matrix elements with high precision, bothstatistical and systematic. To that extent, we plan to capitalize on distillation [1], for con-structing suitable interpolating fields for the nucleon that has already been very successful inspectroscopy computations.The basic idea of distillation is to restrict the operators to a small subspace (the distilla-tion basis) containing substantial contributions of the relevant eigenstates. The reduction inthe rank of the operators dramatically cuts down the cost of computing all elements of thepropagation matrix, allowing for measuring more complex hadron correlation functions.Still, the amount of computation and storage overgrows with the lattice size N and therank of the distillation basis, n . The optimal rank of the distillation basis is determinateexperimentally, but it is roughly proportional to the volume of the spatial dimensions of thelattice. We give an idea of the costs by showing the amount of floating-point operationsand the footprint memory requirements in big-O notation considering a 4D lattice that alldimension have equal size in which the volume of the spatial dimensions is proportional tothe distillation basis rank, n . In these terms, the most expensive parts of the computations arethe formation of the matrix elements, which are tensors generated from contracting matrices(basis), requiring roughly n . operations for mesons (two matrices are contracted) and n . for baryons (three matrices are contracted). The perambulators are square matrices generatedby projecting the inverse of the Dirac operator and require n . operations. At the final step ofthe computation, matrix elements and perambulators are contracted together and that requires n and n operations for mesons and baryons respectively. Table 1 details the time that eachtask takes for baryons in the example used along within this document.In this project, we focus on accelerating the generation of baryon elementals, whose timedominates over the rest of the tasks. Despite the apparent simplicity of tensor contractions, ∗ e-mail: [email protected] ∗∗ e-mail: [email protected] able 1. Asymptotic computational cost, reference time per configuration and time-slice source, andsoftware involved in the time-consuming tasks in estimating the two point correlation functions for alattice 32 ×
64 in a single femto[4]’s node.
Operations Memory ExampleComputation cost footprint time Main librariesDistillation basis n . n . n . n .
16 h HaromPerambulators 100 n . n . / mg_protoContractions n n . ff orts have to be spe-cialized for the characteristics of the tensors and the computing device. Unlike the multi-plication of matrices, few libraries, such as CFT[2] and libtensor[3], provide a powerful andflexible way to specify the tensor contractions if one is willing to sacrifice some performance.The first optimization that we propose does not reduce the number of operations, butinstead reorder the operations to minimize memory requests and operands dependency, in-creasing the utilization of the many arithmetic-logic units available on modern CPUs. Byrelying on high-performance implementations of matrix-matrix multiplication, we take ad-vantage of optimizations specific to the computing device and the dimensions of the tensorsalready available in these libraries. The second approach addresses not only the computa-tional time but also the demanding I / O requirements of the calculation. We have explored atechnique that finds a sparse approximate representation of the distillation basis in a way thatthe baryon elementals are also sparse.
BLAS acceleration of baryon elementals
To explain the caveats of the approach, we introduce a simplified description of the actualtensor contractions that correspond to the computation of the baryon elementals for a par-ticular time-slice, gauge configuration, and displacements combination. The operands of thecontraction include • three 5D tensors (which can be the distillation basis or an operator acting on the basis), v , w , y , with components on the 3D space lattice L N of dimension N , the three color spacecomponents C = , ,
3, and the last index is the distillation basis column from 1 to n ; • a 4D tensor z (which is a phase, z ( x , l ) = e − i p · x ), which also has components on L N and anindex for the momenta; and finally, • the color contraction is indicated with ǫ , which is a sparse, antisymmetric tensor represent-ing the color contraction, ǫ (1 , , =
1, and ǫ ( α,α,β ) = ǫ ( α,β,γ ) = − ǫ ( β,α,γ ) = − ǫ ( α,γ,β ) .The tensors are operated as follows: B ( i , j , k , l ) = X x ∈ L N , α,β,γ ∈ C ǫ ( α,β,γ ) v ( x ,α, i ) w ( x ,β, j ) y ( x ,γ, k ) z ( x , l ) , for 1 ≤ i , j , k ≤ n , ≤ l ≤ M . (1)We studied all alternatives for implementing the tensor contraction Eq. (1) by groupingthe operands into two groups. Table 2 shows all the relevant possibilities. For instance,creating the tensor with ǫ , v , w , and y and then contracted with z requires the minimumnumber of floating-point operations. However, its performance is bounded by the memorybandwidth of the computing node. In modern CPUs with many arithmetic-logic units, thegrouping with better performance, despite doing three times more floating-point operations able 2. Asymptotic extra auxiliary memory and floating-point operations (FLOPs) in computing thetensor contraction at Eq. (1) depending on how the operands are grouped, for a lattice of volume N ,and basis v , w , y of rank n , and basis z of rank M . Grouping Aux. memory FLOPs ǫ v w y z M n N ( ǫ v ) ( w y z ) 6 M n N M n N ( ǫ v w ) ( y z ) 3 n N M n N ( ǫ v w y ) ( z ) n N M n N Table 3.
Performance comparison in time floating-point operations per seconds (GFLOPS) of theoriginal version (Grouping ( ǫ v w y ) ( z )) and the new implementation (Grouping ( ǫ v w ) ( y z )) incomputing the baryon elementals for a configuration with a lattice 32 ×
64, 19 momenta, and 18displacements on a femto’s node. Reported times do not include reading / writing disk operations.Maximum peak performance for a femto’s node is 1800 GFLOPs, and bounded by memory bandwidthis 40 GFLOPS. Grouping ( ǫ v w y ) ( z ) Grouping ( ǫ v w ) ( y z ) n Time (s) GFLOPS Time (s) GFLOPS32 4,923 38 529 1,06664 33,191 45 4,048 1,115128 – 29,219 1,235256 – 162,542 1,776than the previous variant, contracts the auxiliary tensor f , formed with ǫ , v , and w , with thetensor g , formed with y and z .For contracting the tensors f and g , we propose to bypass most of the e ff ort of developingand maintaining a high-performance tensor contraction by relying on optimized BLAS li-braries for computing matrix-matrix multiplications, such as OpenBLAS and MKL. The useof BLAS is a state-of-the-art practice in tensor contraction on CPUs[5] and GPUs[6].We developed the implementation inside the library harom , which is part of the softwaresuite for spectroscopy at Je ff erson Laboratory. Like the rest of the harom’s code, our im-plementation supports shared memory (with OpenMP) and distributed memory (with MPI)paradigms. In harom, the lattice dimensions of the operands are distributed among the pro-cesses. So our code first contracts the local part of the basis v , w , y , and z , and, in the end,a single global reduction adds the partial results at every process. If threading is used, thethreads independently work on di ff erent partitions of the i , j indices of the baryon elemental.We have tested the performance of the new implementation mostly on Intel Skylake andPhi processors. The results show that the new implementation is ten times faster on averagethan the original code in computing the baryon elementals. On Tab. 3, we report the resultson femto [4], a cluster at The Collage of William & Mary. As the range of the distillationbasis increases, the overheads of copying back and forth from the harom representation ofthe tensors to the formats imposed by BLAS diminishes, and the performance of the newimplementation gets closer to the peak performance of the node.A new bottleneck has appeared after the drastic reduction in computing the baryon el-ementals: the time for writing the baryon elementals on the global file systems starts todominate. For a distillation basis with a n =
256 rank, the time that takes writing the baryonelementals on disk is double that the time expended in computing them. The approach that https: // github.com / Je ff ersonLab / harom e introduce in the following aims at reducing not only the computational time but also thestorage of the baryon elementals. Blocked distillation basis
The distillation basis consists of the eigenvectors from the lower part of the spectrum of aLaplacian-like operator ∇ , ∇ x , y ) = δ x , y − X j ∈{ (1 , , , (0 , , , (0 , , } U ( x ) j δ x + j , y + U ( x − j ) j † δ x − j , y , where U is the gauge field restricted to a particular time-slice that may have been smeared,which is not relevant in this context. Like the eigenvectors of the Laplacian, those eigen-vectors have common components on the local scale [7]. The use of a local support basis toapproximate the lower spectrum is a critical ingredient in the construction of the prolongatorand restrictor operators in multigrid [8, 9]. Also, it has been used to compress the eigenvec-tors [10].The baryon elementals generated from this local-supported, sparse bases can be computedfaster, and the resulting tensors are sparse also. These sparse tensors are faster to write ondisk and to contract together with other tensors.The new approach for generating the distillation basis of rank n in a time-slice is asfollowing. First, we divide the lattice in d equally sized domains and restrict f ( ∇ ) to eachof the domains, where f is function on ℜ whose output is non-negative real numbers. Thenew basis is composed by taken n / d eigenvectors from each subdomain with the largesteigenvalues. The function f controls the weight of the eigenvectors. For instance, if f isconstant, all eigenvectors matter equally.We avoid the evaluation of f ( ∇ ) by working with a truncated approximate eigendecom-position of f ( ∇ ), f ( ∇ ) V = V Λ . Then the eigendecomposition of V f ( ∇ ) V † restricted toeach subdomain s , V s † f ( Λ ) V s W s = W s ˜ Λ s is related to the singular value decomposition of V s f ( Λ ): (cid:16) V f ( Λ ) V † (cid:17) s w si = σ si w si ⇔ V s f ( Λ ) = W s Σ s Q s † . (2)Picking more directions on each subdomain than n / d increases the overlap between theresulting basis w i and the original basis v i , although only n directions from the over-ranked w i basis are going to be used as the distillation basis, W ρ , with ρ † ρ = I n , which are theorthogonalization of the projections of v i onto the subspace spanned by w i . In other words, ρ is the Q -factor of the QR factorization of W † V .The function f plays the role of weighing the importance of every direction. To reinforcethe presence of the smallest eigenvalues of ∇ into the new basis, f should be decreasing if welook for the largest singular values of V s f ( Λ ). We have tested a simple family of functionswith a single tunable parameter, α , f ( λ ) = exp( αλ ). Figure 1 shows how, as α increases, theinterior eigenvectors get more presence in detriment of the smallest eigenvectors.When neglecting the perambulators, the hadron correlation functions are reduced to thecorrelation between baryon elementals at di ff erent time-slices. If B ( v ) and B ′ ( v ) are twobaryon elementals at di ff erent time slices, then we want the correlations between B ( v ) and B ′ ( v ), and between B ( w ) and B ′ ( w ) to be highly correlated. In this way, we propose the fol-lowing simple heuristic to tune α based on the transitive property of correlations for highlycorrelated data: to maximize the average correlation between the baryon elementals gener-ated with the original basis and with the new basis for a few time-slices. In principle, thecorrelation between the baryon elementals is not invariant under rotations of the basis, unlike . . i -th column of the original basis, v i c o s ∠ ( W α , v i ) α = − . α = − . α = − . α = − . α = − . Figure 1.
Cosine of the angles between each of the columns of the original basis, v i , and the generatedbasis W α with weight function f ( λ ) = exp( αλ ) for di ff erent values of α . The spatial dimension of thelattice is 32 , blocking 2 in every direction, starting with n ′ =
96 and picking 128 directions ( b = − −
20 00 . . α used to generate w with rank 64 C o rr e l a ti onb e t w ee n B ( v ) a nd B ( w ) ~ p = [0 0 0] ~ p = [1 0 0] ~ p = [2 0 0] ~ p = [3 0 0] 1 8 16 32 640 . .
91 Rank of the distillation bases v and w C o rr e l a ti onb e t w ee n B ( v ) a nd B ( w ) ~ p = [0 0 0] ~ p = [1 0 0] ~ p = [2 0 0] ~ p = [3 0 0] Figure 2.
Average correlation between the baryon elementals generated with the original distillationbasis v and the blocked basis w with weight function f ( λ ) = exp( αλ ), fixing either both bases rank to64 (left) or the parameter α to the optimal value of − . B ( v ) ( i , j , k ) = P x ,α,β,γ ǫ ( α,β,γ ) v ( x ,α, i ) v ( x ,β, j ) v ( x ,γ, k ) e − i ( ~ p , x ) . the actual hadron correlation functions. We address this issue in part by selecting ρ insteadso that W ρ is a close representation of v i , ρ = argmin ρ † ρ = I n k V − W ρ k F = ˆ U ˆ V † , where ˆ U ˆ Σ ˆ V † is the singular value decomposition of V † W .We found that the average correlation function between baryons of both bases is concaveand smooth on α , as Fig. 2 (left) shows. Also, the optimal value of α seems to depend onneither the momenta nor the displacements of the baryon elemental, and the spectra of ∇ fordi ff erent time-slices and configurations are similar enough to use the same value of α for alltime-slices. However, we recommend retuning α when changing the rank of the distillationbasis as the correlation quickly drops as the rank increases, as Fig. 2 (right) shows.
10 20 3010 − − Distance in lattice units (r)Average Ψ ( v , r ) for | r | = r Average Ψ ( w , r ) for | r | = r −
10 0 10 − x coordinate y c oo r d i n a t e Average Ψ ( v , r ) × − for r = [ x , y, −
10 0 10 − x coordinate y c oo r d i n a t e Average Ψ ( w , r ) × − for r = [ x , y, −
10 0 10 − . . . . . . . . . . . x coordinate y c oo r d i n a t e Average Ψ ( v , r ) × − for r = [ x , y, −
10 0 10 − . . . . . . . . . . . . . x coordinate y c oo r d i n a t e Average Ψ ( w , r ) × − for r = [ x , y, Figure 3.
Comparison of the spatial distribution of the distillation basis Ψ of the original basis v andthe blocked basis w in function of the distance (top), and at particular directions (bottom). Moreover, we tested the spatial distribution of the blocked basis. For that, we study Ψ ( r )as done in [1], Ψ ( v , r ) = X x ∈ L N s X α ∈ C , ≤ i ≤ n v ( x ,α, i ) v ( x + r ,α, i ) † , (3)which measures the degree of smearing on the field after restricting it to the distillation basis.At short distances, both bases seem to keep the rotational symmetry (see Fig 3 top). At longdistances, the blocked basis breaks rotational symmetry more than the original basis, as Fig. 3bottom shows.Finally, we present the e ff ective masses using the original basis and the blocked basisin Fig. 4 up to momentum 3. Energies computed with both bases are in agreement withinstatistical error. Thus we do not detect systematic errors introduced by the new basis due to . . λ original χ / N dof = . m t = = . ± . . . λ blocked χ / N dof = . m t = = . ± . . . λ original χ / N dof = . m t = = . ± . . . λ blocked χ / N dof = . m t = = . ± . . . λ original χ / N dof = . m t = = . ± . . . λ blocked χ / N dof = . m t = = . ± . . . λ original χ / N dof = . m t = = . ± . t / a . . λ blocked χ / N dof = . m t = = . ± . t / a Figure 4. E ff ective masses for time-slice shift of 5 using the original distillation basis (left) and theblocked basis (right) up to momenta 3. the more pronounced breaking of rotational symmetry. However, the new basis introducesmore phisical uncertenties in estimating the mass at higher momenta.To verify the performance benefits, we also developed a prototype code that generates thebaryon elementals exploiting the sparsity patterns on the operands of the tensor contraction.The challenge in this development is controlling the overhead of manipulating the spare rep-resentation of tensors. We propose to store the nonzero blocks of the distillation basis and thetensors in a similar way as the Block Sparse Row format does for matrices (see a descriptionat [11]).It may be still useful to accelerate the contraction of tensors by the BLAS’ matrix-matrixmultiplication applied to the contraction of the nonzero tensor blocks. Although, some e ff ortis required to reduce the overhead of calling the BLAS function with small matrices. This isespecially dramatic for the baryon elementals with displacement, which have a more intricatenonzero pattern of small-size blocks. We got some improvement by static linking with BLASand choosing the layout of the tensor’s matrification carefully to improve data locality. Butthere is probably still room for improvement. We got a speedup of 2.8 in time, which is farfrom the expected speedup of 51 if the float-point operations limit the performance. However, able 4. Expected and measured speedups in the time for computing and writing on the globalfilesystem the baryon elementals.
Time speedup Storage reductionBaryon elemental, B ( v ) ( i , j , k ) = P x ,α,β,γ ǫ ( α,β,γ ) e − i ( ~ p , x ) · · · Expected Measured Expected Measured · · · v ( x ,α, i ) v ( x ,β, j ) v ( x ,γ, k )
64 5.2 8 8 · · · v ( x ,α, i ) v ( x ,β, j ) ( D ~ x v ) ( x ,γ, k ) for ~ x ∈ { ~ , ~ , ~ }
50 3.0 4 4 · · · v ( x ,α, i ) v ( x ,β, j ) ( D ~ x D ~ x v ) ( x ,γ, k ) for ~ x ∈ { ~ , ~ , ~ }
50 2.7 4 4 · · · v ( x ,α, i ) v ( x ,β, j ) ( D ~ y D ~ x v ) ( x ,γ, k ) for ~ x , ~ y ∈ { ~ , ~ , ~ }
50 2.6 2 2 · · · v ( x ,α, i ) ( D ~ x v ) ( x ,β, j ) ( D ~ x v ) ( x ,γ, k ) for ~ x ∈ { ~ , ~ , ~ }
50 2.8 2 2 · · · v ( x ,α, i ) ( D ~ y v ) ( x ,β, j ) ( D ~ x v ) ( x ,γ, k ) for ~ x < ~ y ∈ { ~ , ~ , ~ }
50 2.8 2 2Average: 51 2.8 3 3 the reduction in the time spent in writing the elementals on the global filesystem is 2.9, asexpected. The detailed results are shown on Tab. 4.
Conclusions
In this work, we propose two techniques to reduce the leading computational costs in esti-mating baryon correlation functions as the rank of the distillation basis increases, costs thatare dominated by the generation and storage of the baryon elementals.First, we study the performance of generating baryon elementals, which consists of thetensor contraction of the distillation basis, and we propose an implementation that maximizesthe CPU utilization. Most of the fine-tuning is avoided by relying on a high-performanceBLAS library. Second, we propose to approximate the distillation basis in a spatial localsupport basis (blocked basis) by exploiting the local coherence of the lattice Laplacian’seigenvectors. The new basis generates very sparse baryon elementals that are more e ffi cientto compute and store, but also contaminates the basis with higher modes that may increasethe statistical uncertainty in estimating the correlation functions. Further research will eval-uate the e ff ects of the new basis at higher statistical accuracy and on other hadronic states,including excited multihadron states.Although the asymptotic costs remain the same, the resulting reduction in costs pushes abit further the practical limits in the lattice volume and the distillation basis’ rank employedin these calculations. Acknowledgements
This work was primarily supported by the center for nuclear femtography. The gauge fieldconfigurations used in this work are from Je ff erson Laboratory. The Chroma software suite[12] together with Qphix / mg_proto [13] and PRIMME [14] have been also used in this work.This work was performed in part using computing facilities at William & Mary which wereprovided by contributions from the National Science Foundation (MRI grant PHY-1626177),the Commonwealth of Virginia Equipment Trust Fund and the O ffi ce of Naval Research. Inparticular, the majority of this work was performed on the Femto cluster at William & Mary.
References [1] M. Peardon, J. Bulava, J. Foley, C. Morningstar, J. Dudek, R.G. Edwards, B. Joó,H.W. Lin, D.G. Richards, K.J. Juge (Hadron Spectrum Collaboration), Phys. Rev. D , 054506 (2009)2] E. Solomonik, D. Matthews, J.R. Hammond, J.F. Stanton, J. Demmel, Journal of Paral-lel and Distributed Computing , 3176 (2014)[3] E. Epifanovsky, M. Wormit, T. Ku´s, A. Landau, D. Zuev, K. Khistyaev, P. Manohar,I. Kaliman, A. Dreuw, A.I. Krylov, Journal of Computational Chemistry , 2293(2013), https://onlinelibrary.wiley.com/doi/pdf/10.1002/jcc.23377 [4] William & Mary cluster femto , [5] E.D. Napoli, D. Fabregat-Traver, G. Quintana-Ortí, P. Bientinesi, Applied Mathematicsand Computation , 454 (2014)[6] A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar,I. Karlin, T. Kolev, I. Masliah et al., Procedia Computer Science , 108 (2016), inter-national Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, SanDiego, California, USA[7] M. Lüscher, Journal of High Energy Physics , 081 (2007)[8] J. Brannick, R.C. Brower, M. Clark, J.C. Osborn, C. Rebbi, Physical review letters ,041601 (2008)[9] R. Babich, J. Brannick, R.C. Brower, M.A. Clark, T.A. Manteu ff el, S.F. McCormick,J.C. Osborn, C. Rebbi, Phys. Rev. Lett. , 201602 (2010)[10] K. Clark, C. Jung, C. Lehner, PoS The 35th International Symposium on LatticeField Theory, LAT2017 (2017)[11] Y. Saad,
SPARSKIT: a basic tool kit for sparse matrix computations (1994)[12] R.G. Edwards, B. Joó, Nuclear Physics B - Proceedings Supplements , 832–834(2005)[13] B. Joó, D.D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V.W. Lee,P. Dubey, W. Watson,
Lattice QCD on Intel ® Xeon PhiTM Coprocessors , in
Supercom-puting , edited by J.M. Kunkel, T. Ludwig, H.W. Meuer (Springer Berlin Heidelberg,Berlin, Heidelberg, 2013), pp. 40–54, ISBN 978-3-642-38750-0[14] A. Stathopoulos, J.R. McCombs, ACM Transactions on Mathematical Software37