Three Dirac operators on two architectures with one piece of code and no hassle
TThree Dirac operators on two architectures with onepiece of code and no hassle
Stephan Dürr ∗ University of Wuppertal, D-42119 Wuppertal, GermanyIAS/JSC Forschungszentrum Jülich, D-52425 Jülich, GermanyE-mail: durr (AT) itp.unibe.ch
A simple minded approach to implement three discretizations of the Dirac operator (staggered,Wilson, Brillouin) on two architectures (KNL and core i7) is presented. The idea is to use ahigh-level compiler along with OpenMP parallelization and SIMD pragmas, but to stay awayfrom cache-line optimization and/or assembly-tuning. The implementation is for N v right-hand-sides, and this extra index is used to fill the SIMD pipeline. On one KNL node single precisionperformance figures for N c = N v =
12 read 475 Gflop/s, 345 Gflop/s, and 790 Gflop/s for thethree discretization schemes, respectively.
The 36th Annual International Symposium on Lattice Field Theory - LATTICE201822-28 July, 2018Michigan State University, East Lansing, Michigan, USA. ∗ Speaker. c (cid:13) Copyright owned by the author(s) under the terms of the Creative CommonsAttribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). https://pos.sissa.it/ a r X i v : . [ h e p - l a t ] N ov hree Dirac operators on two architectures Stephan Dürr
1. Introduction
Recent years brought plenty of machines with peak performances in the multi-petaflop/s range,but it gets increasingly difficult to achieve good strong-scaling behavior with actual scientific codes.Lattice QCD is still in a fortunate position to harness these capabilities [1, 2, 3], as it does notrequire any run-time dependent data structures. On the other hand, an un-optimized non-parallelcode tends to have O ( ) lines. This means that significant human resources must be spent toparallelize a lattice code and to obtain good performance figures on a given architecture.Ideally one would have a piece of code, written in a high-level language, which parallelizes andreaches decent (read: non-optimal but non-disastrous) performance upon compilation on a givennew architecture. In these proceedings I report on an attempt to do this on the single-node level,based on OpenMP (OMP) threads and again OpenMP pragmas for SIMD-pipelining. I concentrateon the part which takes most time in actual computations – the matrix-times-vector operation for agiven Dirac operator, considering the Susskind (“staggered”), Wilson and Brillouin varieties.
2. Vector layout options
The routines are written in Fortran 2008, using the stride notation, as this allows for com-pact source files (like in matlab). The gauge field U is defined as a 7-dimensional array through complex(kind=sp),dimension(Nc,Nc,4,Nx,Ny,Nz,Nt) :: U , with parameters like Nc=3 and sp,dp (for single and double precision, respectively) specified at compile time. Hence
U(:,:,3,x,y,z,t) defines N c complex numbers, arranged contiguously in memory.With Wilson-type vectors arranged in blocks of N v right-hand sides, the layout options include vec(Nc,4,Nv,...) , vec(4,Nc,Nv,...) , vec(Nc,Nv,4,..) , vec(Nv,Nc,4,..) , vec(4,Nv,Nc,...) , vec(Nv,4,Nc,...) , where the dots stand for Nx*Ny*Nz*Nt . Thisrestriction of having the space-time index as the slowest (right-most) index precludes sophisticatedSIMD strategies (see Refs. [1, 2]), but it may facilitate the use of PGAS concepts (see below). WithSusskind-type vectors arranged in blocks of N v right-hand sides, the layout options under the samerestriction are suv(Nc,Nv,Nx*Ny*Nz*Nt) , suv(Nv,Nc,Nx*Ny*Nz*Nt) .Our task is to optimize the performance under the self-imposed set of restrictions. An impor-tant ingredient in the code is that all contributions to the “out” vector are collected in the thread-private variable site , which for each space-time index is written once . This avoids write collisionsamong threads in a natural way. We have the same 6 layout options as for vec to define site asan array of dimension 3 in the Wilson case (or the same 2 options in the staggered case).
3. Staggered kernel details and performance
The Susskind (“staggered”) Dirac operator is defined through D S ( x , y ) = ∑ µ η µ ( x ) [ V µ ( x ) δ x + ˆ µ , y − V † µ ( x − ˆ µ ) δ x − ˆ µ , y ] (3.1)with η ( x ) = η ( x ) = ( − ) x , η ( x ) = ( − ) x + x , η ( x ) = ( − ) x + x + x . Here V µ ( x ) represents asmeared version of the (original) gauge link U µ ( x ) , i.e. a gauge-covariant parallel transporter from1 hree Dirac operators on two architectures Stephan Dürr
10 100 1 10 100 G f l op / s [ s p ] G f l op / s [ dp ]
100 1 10 G f l op / s [ s p ]
100 1 10 G f l op / s [ dp ] Figure 1:
Log-log scaling plot of the single processor performance in Gflop/s versus the number of OMPthreads for the staggered Dirac operator in sp (left) and dp (right). The top panels feature a KNL processorwith 68 cores, the bottom panels a Broadwell chip with 6 cores. The lattice size is 34 × x + ˆ µ to x , to reduce taste-symmetry breaking. From a HPC viewpoint, a clear advantage of thisoperator with precomputed V µ is that its stencil is restricted to sites which are at most one hopaway. Still, it is not trivial to reach an acceptable performance on a many-core architecture [4, 5].In our framework we have 2 options for the suv layout, 2 options for site , and 2 reasonableloop nestings. It is thus possible to implement all these options, and to compare the timings. Forone choice (the one performing best on the KNL architecture) thread scaling results are shown inFig. 1. Performance on the Broadwell architecture seems far less sensitive to these choices.
4. Wilson kernel details and performance
The Wilson Dirac operator is defined through D W ( x , y ) = ∑ µ γ µ ∇ std µ ( x , y ) − a (cid:52) std ( x , y ) + m δ x , y − c SW ∑ µ < ν σ µν F µν δ x , y , (4.1)where ∇ std µ is a 2-point discretization of the covariant derivative a ∇ std µ ( x , y ) = [ V µ ( x ) δ x + ˆ µ , y − V − µ ( x ) δ x − ˆ µ , y ] (4.2)2 hree Dirac operators on two architectures Stephan Dürr
10 100 1 10 100 G f l op / s [ s p ]
10 100 1 10 100 G f l op / s [ dp ]
10 100 1 10 G f l op / s [ s p ]
10 100 1 10 G f l op / s [ dp ] Figure 2:
Log-log scaling plot of the single processor performance in Gflop/s versus the number of OMPthreads for the Wilson Dirac operator at c SW = sp (left) and dp (right). The top panels feature a KNLprocessor with 68 cores, the bottom panels a Broadwell chip with 6 cores. The lattice size is 34 × and (cid:52) std is a 9-point discretization of the covariant Laplacian (sum over 4 pos. and 4 neg. indices) a (cid:52) std ( x , y ) = − δ x , y + ∑ µ V µ ( x ) δ x + ˆ µ , y . (4.3)Also this operator is HPC friendly, since its stencil contains at most 1-hop terms.In our framework we have 6 options for the vec layout, times 6 options for site , times a fewreasonable loop nestings. For the standard Laplacian (cid:52) std in the Wilson operator the latter set ofoptions is 4, resulting in a total of 144 routines. Evidently, this brings some limitations to the “bestof breed” ansatz, since it may be a little awkward to test each routine with each possible threadcount (e.g. 1 through 272 on the KNL). In practice, it seems reasonable to restrict the selectionprocess to just a few thread counts (e.g. 68, 136, and 272 on the KNL).Full thread scaling results for one such choice are displayed in Fig. 2. Similar to Fig. 1, we seean almost linear increase in performance up to 68 threads on the KNL. This is followed by anotherlinear gain (albeit with a smaller slope) up to 136 threads. Beyond that point results wiggle a bit outto 272 threads. The maximum appears in a number of threads (154 in sp , 138 in dp ) which seemshard to predict. Again, the code seems to perform (without any change) reasonably well on theBroadwell architecture, too. Having more than 2 threads per core does not enhance performance,but the good news is that it’s not really detrimental either. Next steps of performance tuning wouldnaturally include eo-decomposition and gauge compression (from N c to N c − hree Dirac operators on two architectures Stephan Dürr W µ ( x ) = V µ ( x ) (smeared link, µ ∈ {± , ± , ± , ± } )2 24 2!=2 W µ + ν ( x ) = [ V µ ( x ) V ν ( x + ˆ µ ) + perm ] W µ + ν + ρ ( x ) = [ V µ ( x ) V ν ( x + ˆ µ ) V ρ ( x + ˆ µ + ˆ ν ) + perms ] W µ + ν + ρ + σ ( x ) = [ V µ ( x ) V ν ( x + ˆ µ ) V ρ ( x + ˆ µ + ˆ ν ) V σ ( x + ˆ µ + ˆ ν + ˆ σ ) + perms ] Table 1:
Overview of the set of off-axis links W dir ( x ) , with lengths ranging from 1 to 4 hops. Given a site x ,81 directions are possible, but one is trivial, and the remaining 80 can be reduced to 40 based on W − dir ( x ) = W †dir ( x − dir ) . In the code W is precomputed and stored in the array W(Nc,Nc,40,Nx,Ny,Nz,Nt) . Notethat for 36 of the 40 directions the entry is not special unitary, and no gauge compression is possible.
5. Brillouin kernel details and performance
The Brillouin Dirac operator is defined through [6] D B ( x , y ) = ∑ µ γ µ ∇ iso µ ( x , y ) − a (cid:52) bri ( x , y ) + m δ x , y − c SW ∑ µ < ν σ µν F µν δ x , y , (5.1)where the isotropic derivative ∇ iso µ is a 54-point discretization of the covariant derivative a ∇ iso µ ( x , y ) = ρ [ W µ ( x ) δ x + ˆ µ , y − W − µ ( x ) δ x − ˆ µ , y ]+ ρ ∑ (cid:54) =( ν ; µ ) [ W µ + ν ( x ) δ x + ˆ µ + ˆ ν , y − ( µ → − µ )]+ ρ ∑ (cid:54) =( ν , ρ ; µ ) [ W µ + ν + ρ ( x ) δ x + ˆ µ + ˆ ν + ˆ ρ , y − ( µ → − µ )]+ ρ ∑ (cid:54) =( ν , ρ , σ ; µ ) [ W µ + ν + ρ ( x ) δ x + ˆ µ + ˆ ν + ˆ ρ + ˆ σ , y − ( µ → − µ )] (5.2)and the Brillouin Laplacian (cid:52) bri is a 81-point discretization of the covariant Laplacian a (cid:52) bri ( x , y ) = λ δ x , y + λ ∑ µ W µ ( x ) δ x + ˆ µ , y + λ ∑ (cid:54) =( µ , ν ) W µ + ν ( x ) δ x + ˆ µ + ˆ ν , y + λ ∑ (cid:54) =( µ , ν , ρ ) W µ + ν + ρ ( x ) δ x + ˆ µ + ˆ ν + ˆ ρ , y + λ ∑ (cid:54) =( µ , ν , ρ , σ ) W µ + ν + ρ + σ ( x ) δ x + ˆ µ + ˆ ν + ˆ ρ + ˆ σ , y (5.3)with ( ρ , ρ , ρ , ρ ) ≡ ( , , , ) /
432 and ( λ , λ , λ , λ , λ ) ≡ ( − , , , , ) /
64 respectively.In (5.2) the last sum extends over (pos. and neg.) indices ( ν , ρ , σ ) which are mutually unequal anddifferent from µ . In (5.3) the last sum extends over indices ( µ , ν , ρ , σ ) which are pairwise unequal.In these formulas W dir ( x ) denotes a link in direction “dir” which may be on-axis (dir= µ ) or off-axiswith length √ µν ) or √ µνρ ) or √ µνρσ ). More details are given in Tab. 1.The Brillouin operator (5.1) brings new perspectives on PDFs [6, 7] and – when used in con-junction with the overlap procedure [8, 9] – on heavy-quark physics [10, 11]. From a HPC view-point the Brillouin operator is interesting, since its computational intensity is by a factor 2 . N v right-hand-sides simultaneously enhancesthe computational intensity of either operator, while keeping this ratio almost invariant [11]. In avery distant future, when cycles are totally irrelevant, one might opt for not precomputing W ; thiswould trigger a massive enhancement of the computational intensity of the operator (5.1).4 hree Dirac operators on two architectures Stephan Dürr
10 100 1 10 100 G f l op / s [ s p ] G f l op / s [ dp ]
10 100 1 10 G f l op / s [ s p ]
10 100 1 10 G f l op / s [ dp ] Figure 3:
Log-log scaling plot of the single processor performance in Gflop/s versus the number of OMPthreads for the Brillouin Dirac operator at c SW = sp (left) and dp (right). The top panels feature a KNLprocessor with 68 cores, the bottom panels a Broadwell chip with 6 cores. The lattice size is 34 × Thread scaling results are shown in Fig. 3. As in previous cases, we see nearly-perfect scalingbehavior up to 68 threads on the KNL. After this, there is another performance increase up to 136threads. Beyond that point, performance figures wiggle a bit out to 272 threads. And again, the(unchanged, just recompiled) code performs reasonably well on the Broadwell architecture, too.Unlike in the Wilson case, the author is unaware of simple recipes for further improvement.
6. Summary and outlook
In summary, thread scaling results for the Susskind, Wilson and Brillouin varieties of theDirac operator in lattice QCD were presented. They are based on straightforward implementationsin Fortran 2008, with OpenMP pragmas for shared-memory parallelization and SIMD pipelining.No cache-access optimization and no hand-assembly tuning have been applied, and still rea-sonable performance figures can be obtained. Key to the improvement over last year’s version [12]is an ansatz where performance critical routines are written for a variety of layouts of the in/out-vectors, of accumulation variables, and possibly loop nestings. For a given architecture and com-piler combination, all of these routines are compiled “out of the box”, and a few test calls willquickly reveal which option features best on a particular machine. With this “best of breed” ansatz,the winning combination is subsequently used for actual computations.5 hree Dirac operators on two architectures
Stephan Dürr
The plots presented in this contribution illustrate that this strategy proves successful on thesingle-node level. What would be important for actual calculations, however, is a working concept(along these lines) on the multi-node level. The author’s hope is that vendors will finally provideefficient PGAS (Coarray Fortran and/or Unified Parallel C) support for CPUs. In fact, this is the ra-tionale for not sacrificing any of the space-time indices for the SIMD vectorization. In the event theGPU world offers this feature more promptly, this will be a clear case for switching to OpenACC.
Acknowledgements : Program development and test runs were performed on the DEEP-ERsystem at IAS/JSC in Jülich. The author likes to thank Eric Gregory for useful discussion.
References [1] P. A. Boyle, PoS LATTICE , 013 (2017) doi:10.22323/1.256.0013 [arXiv:1702.00208 [hep-lat]].[2] A. Rago, EPJ Web Conf. , 01021 (2018) doi:10.1051/epjconf/201817501021 [arXiv:1711.01182].[3] M. Lin, “Machines and Algorithms for Lattice QCD,” talk at Lattice 2018 (these proceedings).[4] C. DeTar, D. Doerfler, S. Gottlieb, A. Jha, D. Kalamkar, R. Li and D. Toussaint, PoS LATTICE ,270 (2016) doi:10.22323/1.256.0270 [arXiv:1611.00728 [hep-lat]].[5] C. DeTar, S. Gottlieb, R. Li and D. Toussaint, EPJ Web Conf. , 02009 (2018).doi:10.1051/epjconf/201817502009 [arXiv:1712.00143 [hep-lat]].[6] S. Durr and G. Koutsou, Phys. Rev. D , 114512 (2011) [arXiv:1012.3615 [hep-lat]].[7] S. Durr, G. Koutsou and T. Lippert, Phys. Rev. D , 114514 (2012) [arXiv:1208.6270 [hep-lat]].[8] H. Neuberger, Phys. Lett. B , 141 (1998) [hep-lat/9707022].[9] H. Neuberger, Phys. Lett. B , 353 (1998) [hep-lat/9801031].[10] Y. G. Cho, S. Hashimoto, A. Juttner, T. Kaneko, M. Marinkovic, J. I. Noaki and J. T. Tsang, JHEP , 072 (2015) [arXiv:1504.01630 [hep-lat]].[11] S. Durr and G. Koutsou, arXiv:1701.00726 [hep-lat].[12] S. Durr, EPJ Web Conf. , 02001 (2018) doi:10.1051/epjconf/201817502001 [arXiv:1709.01828]., 02001 (2018) doi:10.1051/epjconf/201817502001 [arXiv:1709.01828].