[PDF] Lattice QCD package GWU-code and QUDA with HIP

Abstract

Full PDF

LLattice QCD package GWU-code and QUDA with HIP

Yu-Jiang Bi ∗† Institute of High energy Physics, Chinese Academy of Sciences, Beijing 100049, China

Yi Xiao

The Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, ChinaE-mail: [email protected]

Ming Gong

Institute of High energy Physics, Chinese Academy of Sciences, Beijing 100049, China

Wei-Yi Guo

Department of Physics, University of Warwick, Coventry, CV4 7AL, United Kingdom

Peng Sun

Nanjing Normal University, Nanjing, Jiangsu, 210023, ChinaE-mail:

Shun Xu

Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China

Yi-Bo Yang

CAS Key Laboratory of Theoretical Physics, Institute of Theoretical Physics, Chinese Academyof Sciences, Beijing 100190, ChinaE-mail: [email protected]

The open source HIP platform for GPU computing provides an uniform framework to supportboth the NVIDIA and AMD GPUs, and also the possibility to porting the CUDA code to the HIP-compatible one. We present the porting progress on the Overlap fermion inverter (GWU-code)and also the general Lattice QCD inverter package - QUDA. The manual of using QUDA on HIPand also the tips of porting general CUDA code into the HIP framework are also provided. ∗ Speaker. † This work is supported in part by the Strategic Priority Research Program of Chinese Academy of Sciences (GrantNo. XDC01040100) and the National Key Research and Development Program of China (No. 2017YFB0203200). M.Gong is supported in part by the National Science Foundation of China (NSFC) under the project No. 11775229, P. Sunis supported by Natural Science Foundation of China under grant No. 11975127, Y. Yang is supported in part by ChineseAcademy of Science CAS PioneerÂ˘aHundredÂ˘aTalentsÂ˘aProgram. c (cid:13) Copyright owned by the author(s) under the terms of the Creative CommonsAttribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). https://pos.sissa.it/ a r X i v : . [ h e p - l a t ] J a n atticeQCD with HIP Yu-Jiang Bi

1. Introduction and background

The engineering and energy efﬁciency constraints push the modern supercomputer architec-ture to the multi-level parallelism, and heterogeneous computing architectures such as CPU+GPUare widely used in top500 supercomputers, including the most recent fastest two, Summit andSierra. Most of the performance on such a computer come from the Nvidia GPU V100, and anefﬁcient code is essential to beneﬁt the related Lattice QCD calculation from those machines.There are already quite a few package can support the Nvidia GPU code platform CUDA withgood performance and also multi-GPU scaling, including QUDA (for most of the fermion ac-tions) [1, 2, 3], GRID (for the domain wall fermion and etc.) [4], GWU-code (for the overlap andclover fermion) [5, 6], and so on.On the other hand, the efﬁcient code on the AMD GPU falls behind except some effects withOpenCL (e.g., CL2QCD [7]), while the peak performance of the AMD GPU have caught up andthe E-ﬂops supercomputer “Frontier" with AMD GPU will be built in US by 2021. In recent years,an open source HIP platform is promoted to support both the Nvidia and AMD GPUs, which alsoprovide the possibility to porting the CUDA codes to the HIP platform. Based on this, writingthe code from scratch and implementing hundreds features needed by the lattice QCD calculation,would be avoid by porting the existed CUDA codes. In this proceeding, we will present our ﬁndingon using the package GWU-code and QUDA on the AMD GPU through the HIP platform, with asummary on the known issues.

2. Porting tips and compiling manual of QUDA

Generally, the porting can be separated into 3 stages, convert the code with hipify-perl, patchthe codes manually to satisfy the requirement of compiler, replace the unsupported features to avoidthe runtime crash. Let us taking the porting of QUDA as example:1). Convert the code with hipify-perl. The hipify-perl is a perl script to map the name of theCUDA functions to that of their HIP counterpart, and also the CUDA head ﬁles. If the script meetssome unknown words starting with "cu", message “warning:... : unsupported device function" willbe thrown out, while the conversion will continue.2). Patch the codes manually to satisfy the requirement of compiler. Currently, QUDA iscompiled by hip-clang, instead of HCC (Heterogeneous Compute Compiler). HCC is HC compiler,which is C ++ AMP syntax language with HSA Extend [8]. HCC will translate HIP kernel syntaxinto C ++ AMP syntax by using functional or macro grid launch, while certain QUDA devicefunction used complicated class and template will cause syntax or runtime error. In the otherhand, hip-clang is a hip kernel syntax supported LLVM frontend as shown in Fig. 1. By settingthe environment variable HIP_PLATFORM to clang, hip-clang will take over the compiling andcompile CUDA-like syntax source code to LLVM IR directly [9], and then the AMD GPU backendof LLVM will compile the LLVM IR to binary. The patches we applied include:2.1) Set CMAKE_CXX_SYSROOT_FLAG_CODE to add the .cu sufﬁx to theCMAKE_CXX_SOURCE_FILE_EXTENSIONS, and then use hip-clang to compile both the .cppand .cu ﬁles. Note that the ﬂag “-g" should be avoid and “-_STRICT_ASNI_ -O3" is necessary1 atticeQCD with HIP

Yu-Jiang Bi

Figure 1:

HIPCC Compilation Process. The clang compiler skips the step to generate the HC source code. to make the code works well. Compile clover_deriv_quda.cu and gauge_stout.cu will crash thecompiler and add the ﬂag "-fno-inline" would be a choice to avoid it.2.2) Fix the functions with different prototype, likes hipGetErrorString, hipGetErrorName,hipPointerGetAttributes, hipMemcpyHtoDAsync and etc. At the same time, some of the functionslikes blasCgetrfBatched and blasCgetriBatched are replaced with the CPU version.2.3) Add the __host__ ﬂags in front of the __device __ functions which used in the CPU code.2.4) Rewrite the ptx code into the normal C++ codes.2.5) Move the declare of the shared memory into the body of the functions, as it can not belocated in the link stage.3). Replace the unsupported features to avoid the runtime crash. The major changes include:3.2) Suppose the memoryType is host if hipPointerGetAttributes return an error, and commentout checkCudaError() in the constructors;3.3) Limited the maximum threads used in the tuning to 512 or even smaller number, if certainfunction crashes on the AMD GPU with more threads.3.4) Use the global function to copy the constant array needed by multi_blas_kernel to theglobal memory, as the function hipmemcpyToSymbol doesn’t work correctly: __global__ void set_Amatix(signed char *ref) {int idx = blockIdx.x * blockDim.x + threadIdx.x;if(idx>=MAX_MATRIX_SIZE)return;Amatrix_d[idx]=ref[idx];} signed char *A_d;hipMalloc(&A_d, MAX_MATRIX_SIZE);hipMemcpy(A_d,A,MAX_MATRIX_SIZE,hipMemcpyHostToDevice);set_Amatix<<<256,MAX_MATRIX_SIZE/256>>>(A_d);hipDeviceSynchronize();hipFree(A_d); atticeQCD with HIP Yu-Jiang Bi

Module name Ported TestedQUDA_DIRAC_WILSON yes yesQUDA_DIRAC_CLOVER yes yesQUDA_CONTRACT yes yesQUDA_COVDEV yes yesQUDA_DIRAC_STAGGERED yes noQUDA_FORCE_GAUGE yes noQUDA_DIRAC_DOMAIN_WALL no noQUDA_DIRAC_TWISTED_MASS no noQUDA_DIRAC_TWISTED_CLOVER no noQUDA_DIRAC_CLOVER_HASENBUSCH no noQUDA_DIRAC_NDEG_TWISTED_MASS no noQUDA_LINK_ASQTAD no noQUDA_LINK_HISQ no noQUDA_FORCE_HISQ no noQUDA_GAUGE_TOOLS no noQUDA_GAUGE_ALG no noQUDA_DYNAMIC_CLOVER no no

Table 1:

Summary of the porting progress in term of QUDA build options. All the parts needed for aMultigrid inverter of the Clover fermion have been done. __global__ func(Arg arg){...} can make the performance to be extremely low. Such a problem can be avoid by copying theargument class to the GPU memory ﬁrst, and use the its reference as the argument: __global__ func(Arg &arg){...}Arg *arg_d;hipMalloc(&arg_d, sizeof(Arg));hipMemcpy(arg_d,&arg,sizeof(Arg),hipMemcpyHostToDevice);func(*arg_d);hipDeviceSynchronize();hipFree(arg_d);

Other minor hacks can be found in the present branch on the github: https://github.com/lattice/quda/tree/rocm-devel.

The present code merged the recent released QUDA 1.0.0. In term of the QUDA build options,the porting progress are listed in Table 1. 3 atticeQCD with HIP

Yu-Jiang Bi

Peak FP32 (TFlops) Bandwidth (TB) GWU-code (TFlops) QUDA (TFlops)Nvidia V100 14.7 0.9 0.80 1.11AMD MI60 14 (for PCIe) up to 1.0 0.54 0.48

Table 2:

The single precision peak performance and memory bandwidth of V100 and MI60, v.s. the D-slashperformances using the GWU-code and QUDA with the same single precision. The SU(3) is suppressed to12 real numbers in both the GWU-code and QUDA cases to save the memory bandwidth. Note that the testson V100 uses CUDA, not HIP.

3. GWU-code case

Comparing to QUDA, porting GWU-code is much simpler. GWU-code use the macro togenerate the D-slash GPU kernel without any device functions, and implement the vector operatoron GPU with the CUDA Thrust [5, 6]. Thus one just need to replace the Thrust library with therocthrust after the code has been converted with hipify-perl. The function thrust::reduce can bevery slow with double precision in certain version, but it can be replaced by the other functionswith normal performance.

4. Performance and issues

Our test is based on AMD MI60 GPU and Nvidia V100 GPU. The D-slash performance issummarized in Table 2. The QUDA dslash performance is somehow lower as this kernel is muchmore complicated and then can not be fully optimized with hip-clang at present.As the practical application of QUDA, the multigrid inverter works correctly while the perfor-mance is not very permising. With the largest 96 ×

192 lattice we tested, the total performancewith 324 MI60 GPUs is around 10 TFlops, using a 3-level multigrid layouts (4,4,4,4) and (2,2,2,2).In the GWU-code side, 200 pairs of the Overlap eigensystem of the HYP smeared 24 × a =0.11fm can be generated with 4 MI60 GPU in 4 hours, and the similarcalculation with E5-2698v3 at 2.3 GHz requires 1024 cores for 2 hours. The test with larger size isin progress.

5. Summary

In summary, we port two Lattice QCD CUDA packages, GWU-code and QUDA to the AMDGPU platform using HIP. The performance is around half of that on the CUDA when the memorybandwidth of them are similar. The multigrid inverter of QUDA works correctly with the lattice aslarge as 96 × References [1] M. A. Clark, R. Babich, K. Barros, R. C. Brower, and C. Rebbi. Solving Lattice QCD systems ofequations using mixed precision solvers on GPUs.

Comput. Phys. Commun. , 181:1517–1528, 2010. atticeQCD with HIP Yu-Jiang Bi[2] R. Babich, M. A. Clark, B. Joo, G. Shi, R. C. Brower, and S. Gottlieb. Scaling Lattice QCD beyond100 GPUs. In

SC11 International Conference for High Performance Computing, Networking, Storageand Analysis Seattle, Washington, November 12-18, 2011 , 2011.[3] M. A. Clark, Blint Jo, Alexei Strelchenko, Michael Cheng, Arjun Gambhir, and Richard Brower.Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained Parallelization. 2016.[4] Peter Boyle, Azusa Yamaguchi, Guido Cossu, and Antonin Portelli. Grid: A next generation dataparallel C++ QCD library. 2015.[5] A. Alexandru, C. Pelissier, B. Gamari, and F. Lee. Multi-mass solvers for lattice QCD on GPUs.

J.Comput. Phys. , 231:1866–1878, 2012.[6] Andrei Alexandru, Michael Lujan, Craig Pelissier, Ben Gamari, and Frank X. Lee. Efﬁcientimplementation of the overlap operator on multi-GPUs. In

Proceedings, 2011 Symposium onApplication Accelerators in High-Performance Computing (SAAHPC’11): Knoxville, Tennessee, July19-20, 2011 , pages 123–130, 2011.[7] Owe Philipsen, Christopher Pinke, Alessandro Sciarra, and Matthias Bach. CL QCD - Lattice QCDbased on OpenCL.

PoS

Proceedings of the 2016International Symposium on Code Generation and Optimization. ACM, 2016: 105-116. , 2016., 2016.