IPMACC: Open Source OpenACC to CUDA/OpenCL Translator
IIPMACC: Open Source OpenACC toCUDA/OpenCL Translator
Ahmad LashgarUniversity of Victoria Alireza MajidiTexas A&M UniversityAmirali BaniasadiUniversity of VictoriaSeptember 26, 2018
Abstract
In this paper we introduce IPMACC, a framework for translating Ope-nACC applications to CUDA or OpenCL. IPMACC is composed of set oftranslators translating OpenACC for C applications to CUDA or OpenCL.The framework uses the system compiler (e.g. nvcc) for generating fi-nal accelerator’s binary. The framework can be used for extending theOpenACC API, executing OpenACC applications, or obtaining CUDA orOpenCL code which is equivalent to OpenACC code. We verify correct-ness of our framework under several benchmarks included from RodiniaBenchmark Suit and CUDA SDK. We also compare the performance ofCUDA version of the benchmarks to OpenACC version which is compiledby our framework. By comparing CUDA and OpenACC versions, wediscuss the limitations of OpenACC in achieving a performance near tohighly-optimized CUDA version.
In CUDA [8], an application is composed of host and device codes. The hostcode executes on CPU and the device code executes on system’s accelerator.The host controls the operations of the device through procedure calls to CUDAAPI. CUDA allows programmers to explicitly allocate memory on device andtransfer data between the host and the device. The device obtains the devicecode from kernel and executes it by thousands of light-weight threads, in SIMTstyle [6]. All threads share common off-chip DRAM memory or global memory.In software, threads are grouped into coarser scheduling elements, referred to asthe thread block. Threads within the same block execute concurrently and com-municate through a fast, per-block, on-chip software-managed cache, referred1 a r X i v : . [ c s . P L ] D ec o as shared memory. Shared memory is much faster than global memory; e.g.under GTX 280, the latency of global memory and shared memory are 440 and38 core cycles, respectively [17]. OpenACC API introduces a set of compiler directives, library routines, and en-vironment variables to offload a region of code from the CPU to the system’saccelerator [15]. We refer to this region as the accelerator region. OpenACC hastwo classes of directives: i) data management and ii) parallelism control. Eachdirective has clauses providing finer-grain control. Data management directivesperform data allocation on the accelerator, data transfer between the host andthe accelerator, and passing pointers to the accelerator. Parallelism control di-rectives allow the programmer to mark regions of code, usually work-sharingloops, intended to run in parallel on the accelerator. They also control paral-lelism granularity, variable sharing/privatization, and variable reduction. Ope-nACC introduces four levels of parallelism: gang, worker, vector, and thread.In CUDA terminology, these terms best map to kernel, thread block, warp, andthread, respectively.
Listing 1a and 1b illustrate a simple matrix-matrix multiplication in OpenACCand CUDA, respectively. Ignoring the directive lines, Listing 1a shows thebaseline serial multiplication of a and b, storing the result in c. Each matrix isLEN*LEN in size. The outer loops iterated by i and j variables can be executedin parallel. Listing 1a shows how these loops can be parallelized using Ope-nACC. In this code, kernels directive marks a region intended to be executedon the accelerator. loop directive guides the compiler to consider the loop asa parallel work-sharing loop. Programmers can control the parallelism usingkernels and loop directives. As an example of parallelism control, the indepen-dent clause is used to force the compiler to parallelize the loop. This clauseoverwrites the compiler’s auto-vectorization and loop dependency checking. InListing 1a, data clauses hint the compiler to copy a, b, and c arrays from thehost to the accelerator, and copy them out from the accelerator to the host. Foreach array, the [start:n] pair indicates that n elements should be copied fromthe start element of array. ). Listing 1b shows how the parallelization can beexploited in CUDA. global indicates the declaration of kernel code. Parallelthreads execute the kernel and operate on different matrix elements, based ontheir unique indexes (i and j). Inside the host code, device memory is allocatedfor a, b, and c, keeping the pointer in a d, b d, and c d, respectively. Then,input matrices are copied into device memory. Then, a total of LEN*LENlight-weight accelerator threads are launched on the device to execute matrix-Mul kernel. After kernel completion, the resulting matrix c d is copied backto the host memory. As presented in Listing 1, OpenACC significantly reducesthe accelerator programming effort in comparison to CUDA. OpenACC hides2isting 1: OpenACC and CUDA matrix-matrix multiplications. ∗ LEN],b[0:LEN ∗ LEN],c[0:LEN ∗ LEN]) < LEN; ++i) { < LEN; ++j) { float sum=0;for(l=0; l < LEN; ++l) sum += a[i ∗ LEN+l] ∗ b[l ∗ LEN+j];c[ i ∗ LEN+j]=sum; }} (a) OpenACC.globalvoid matrixMul(int ∗ a, int ∗ b, int ∗ c, int len) { int i=threadIdx.x+blockIdx.x ∗ blockDim.x;int j=threadIdx.y+blockIdx.y ∗ blockDim.y;for(int l=0; l < len; ++l) sum=a[i ∗ len+l] ∗ b[l ∗ len+j];c[ i ∗ len+j]=sum; } int main() { ...bytes=LEN ∗ LEN ∗ sizeof(int);cudaMalloc(&a d, bytes);cudaMalloc(&b d, bytes);cudaMalloc(&c d, bytes);cudaMemcpy(a d, a, bytes, cudaMemcpyHostToDevice);cudaMemcpy(b d, b, bytes, cudaMemcpyHostToDevice);dim3 gridSize(LEN/16,LEN/16), blockSize(16,16);matrixMul <<< gridSize,blockSize >>> (a d,b d,c d,LEN);cudaMemcpy(c, c d, bytes, cudaMemcpyDeviceToHost);... } (b) CUDA. IPMACC compilation process starts with an input C/C++ code which is en-hanced by OpenACC API to take advantage of accelerators. The output ofthe process can be an object code, binary, or C/C++ source code targeted foreither CUDA- or OpenCL-capable accelerator. Figure 1 shows the diagram ofcompilation process which consists of four major stages. In this section, wedescribe the compilation process in more details. Descriptions in this sectioncan be used to modify IPMACC to generate customized code.Figure 1: IPMACC compilation pipeline.
This stage performs pre-processing to normalize/verify the syntax of inputC/C++ and OpenACC API. We use uncrustify [16] to validate C/C++ syntaxand normalize its notation. For example, the region of control (if) and loop(while, for) statements will be fully-bracketed. Such polishing passes mark thescope associated with each OpenACC region; simplifying subsequent stages.We have also developed a set of scanners to validate OpenACC API syntax.The scanners assert the validity of directive in respect to OpenACC API andassert the validity of clauses in respect to directive. It also asserts OpenACCrestrictions of using nested directives.
This stage transforms the amended input code to intermediate XML form. TheXML form has only three types of tags: C code, OpenACC pragma, and forloop. During compilations, the codes encompassed in the C code tags remainunmodified. OpenACC pragma tags will be replaced by proper calls to imple-ment the accelerator-oriented operations. for loop tags will be either parallelizedon accelerator or stay as they are (e.g. serial). This decision is made based onthe preceding OpenACC directive (e.g. loop directive) or auto-vectorizationoptimizations. This intermediate representation separates the host/acceleratorand remarkably facilitates OpenACC translation in next stage.4 .3 Stage 3: Translating OpenACC API to target code(CUDA/OpenCL)
There are nine sequential steps to translate the intermediate XML form to thefinal CUDA/OpenCL source code.
This step returns the XML form to C/C++ form while replacing accelerator-related codes (OpenACC pragma tags) with a dummy function call. Meanwhile,the process maintains the OpenACC information related to dummy functions.These are essences used to generate corresponding CUDA/OpenCL codes.The process conceptually splits the code into two codeblocks (while they arealready linked through dummy functions): i) the code bounded by OpenACCAPI (referred to as Regions code), and ii) the code placed outside of OpenACCboundaries (referred to as Original code). Original code is executed by host.Regions code, which includes OpenACC data clauses and kernels regions, isexecuted either by host or accelerator.In this terminology, at this step, each Regions code is replaced by a dummyfunction call in Original code generating flat host code plus a number of dummyfunction calls. At the end of compilation, dummy functions are replaced bythe target-accelerator codes launching computations on the target accelerator,controlling memory transfers between accelerator and host, and synchronizinghost-accelerator operations.
This step calculates the abstract syntax tree (AST) of Original code. We useAST representation to find further information about variables, types, and func-tions which are used in the Regions code but are not declared in that scope.Particularly, the operation searches for the type of variables, size of arrays, anddeclaration of user-defined non-standard functions/types.
The scope of Regions code contains the declarations/prototypes (function, vari-able, or type) which are used in the region. It consists of global scope in additionto the scope where the Regions code is called. Since a dummy function call isthe representative of a Region code, the scope of dummy function’s parent isthe scope of declarations/prototypes used in the Regions code. Accordingly,this step finds the parent function calling each dummy call (notice that dummyfunctions are unique and have only one parent). The global scope and the scopeof parent function are searched for the declarations/prototypes that are referredto in the Regions code. 5 .3.4 Construct kernel code
This stage constructs the body of kernel – the targeted code to be executedon the accelerator. This construction includes: i) specifying the available par-allelism, ii) sharing loop iterations between concurrent accelerator threads, iii)performing variable reductions, iv) regenerating out-defined declarations/pro-totypes, and v) specifying kernel arguments.
The dummy function calls which correspond to OpenACC memory manage-ment clauses are replaced by OpenACC function calls that implement the datamanagement operations (Table 1). Data managements include host-acceleratorpointer exchange, data copy in/out to/from accelerator, and memory allocation.Memory allocation essentially needs the size of memory. The size is either pro-vided by the programmer manually (through the clauses parameters) or detectedby the compiler automatically (only the fixed-size arrays are identifiable).
This step finds the non-standard types and non-built-in functions which arecalled in the Region code. Subsequently, it searches the AST of Original code,which is generated on the 2nd step, to find the declarations of these user-definednon-standard functions/types. Later, these declarations will be appended to thefinal kernel code.
This step replaces each dummy function associated with kernel calls with acodeblock performing kernel body setup, kernel argument arrangement, andkernel invocation. There is an extra code in case of variable reduction (e.g.reduction clause in loop directive) that merges results across different threadblocks. We implement variable reduction according to the algorithm proposedin [3].
At this step, the Original code has been enhanced to launch computation onthe target accelerator (CUDA/OpenCL -capable accelerator).
This step stores the enhanced Original code on disk, in the same path as theinput C/C++ file with ipmacc.[cu/c/cpp] suffix.6 .4 Stage 4: Generating the final object code/binary
This stage invokes system compiler (nvcc for CUDA backend and g++ in othercases) to generate the target binary. Input to this stage is the source code whichis generated in stage 3 and the output can be an object code or executable binary.Operations of this stage are controlled by the ipmacc compilation flags.
Benchmarks.
We use benchmarks from NVIDIDA CUDA SDK [9] and Ro-dinia benchmark suit [2]. NVIDIA CUDA SDK includes a large set of CUDAtest-cases, each implementing a massively-parallel body of an application inCUDA very efficiently. Each test-case also includes a serial C/C++ implemen-tation. We developed an OpenACC version of these benchmarks over the serialC/C++ code. Rodinia is a GPGPU benchmark suite composed of a wide setof workloads implemented in C/C++. Originally, each of these benchmarkswas implemented in CUDA and OpenCL parallel models. Recently, OpenACCimplementation of the benchmarks has been added by a third-party [13]. Weinclude Dyadic Convolution and N-Body simulation from CUDA SDK and theremaining benchmarks from Rodinia.
OpenACC Compiler.
We use our in-house framework, IPMACC, for com-piling OpenACC applications. IPMACC translates OpenACC to either CUDAor OpenCL and executes OpenACC application over CUDA or OpenCL run-time (e.g. NVIDIA GPUs or AMD GPUs). We validated the correctness of ourframework by comparing the results of OpenACC benchmarks against the serialand CUDA version.
Performance evaluations.
We compile the OpenACC version of bench-marks by our framework and run it over CUDA runtime. We compare these toCUDA implementations available in CUDA SDK and Rodinia. In order to eval-uate performance, we report the total time of kernel execution, kernel launch,and memory transfer between host and accelerator. We use nvprof for mea-suring these times [10]. For kernel execution and memory transfers time, wereport the time that nvprof reports after kernels/transfers completion. For ker-nel launch time, we report the time measured by nvprof in calling cudaLaunch , cudaSetupArgument , and cudaConfigureCall API procedures. Every reportednumber is the harmonic mean of 30 independent runs.
Platforms.
We perform the evaluations under a CUDA-capable accelerator.We use NVIDIA Tesla K20c as the accelerator. This system uses NVIDIACUDA 6.0 [9] as the CUDA implementation backend. The other specificationsof this system are as follows: CPU: Intel Xeon CPU E5-2620, RAM: 16 GB, andoperating system: Scientific Linux release 6.5 (Carbon) x86 64. We use GNUGCC 4.4.7 for compiling C/C++ files. 7 UD A O pen A CC CUD A O pen A CC CUD A O pen A CC CUD A O pen A CC CUD A O pen A CC CUD A O pen A CC CUD A O pen A CC CUD A O pen A CC CUD A O pen A CC CUD A O pen A CC N o r m a li z e d E xec u t i onT i m e Memory transfer Kernel execution Launch overheadBackprop BFS dyadic.HotspotMatrix Mul.N-BodyNearest.Needle-WunschPathfinderSRAD
Figure 2: Comparing the execution time of OpenACC to highly-optimizedCUDA implementations. Each bar shows the duration of time that the applica-tion spends on memory transfer, kernel execution, and kernel launch overhead.
In this section, we compare a set of OpenACC applications to their highlyoptimized CUDA version. Our goal is to identify OpenACC’s programminglimitations resulting in the performance gap between OpenACC and CUDAperformance. See Methodology section for applications, compilers, and hard-ware setup.
Figure 2 reports the execution time for OpenACC applications, compared totheir CUDA version. The figure reports the breakdown of time spent on theaccelerator; kernel launch (launch), kernel execution (kernel), or memory trans-fer between host and accelerator (memory). Kernel launch time includes thetime spent on setting kernel arguments and launching the kernel on the ac-celerator. In most cases, CUDA’s kernel launch/execution portion is shorterthan OpenACC. Also, memory transfer times are comparable on both CUDAand OpenACC. There are exceptions where OpenACC memory transfers arefaster (e.g. Backpro.) or kernel time of CUDA and OpenACC are equal (e.g.Nearest.). We investigate the differences between CUDA and OpenACC in thefollowing sections. 8 .2 Investigating Performance Gap
In this section, we discuss applications separately providing insight into whyCUDA and OpenACC implementations presented in Figure 2 have differentkernel launch, kernel execution, and memory transfer times.
Back Propagation.
Back Propagation (Backpro.) is a machine-learningalgorithm used to train the weights in a three-layer neural network. In bothOpenACC and CUDA versions, there are six back-to-back serial operationswhere the output of each stage is fed to the immediate next stage as input.Each stage can be performed in parallel on the accelerator.OpenACC implementation performs faster memory transfers and slower ker-nel launch/execution, compared to CUDA. This is explained by the fact thatthe OpenACC version executes all six stages on GPU, while the CUDA versionalternates between CPU and GPU for execution. Alternating between CPU andGPU imposes extra memory transfer overhead.
BFS.
BFS visits all the nodes in the graph and computes the visiting costof each node. Each node is visited only once. Parallel threads of a kernel visitthe nodes belonging to the same graph depth concurrently and the algorithmtraverses through the depth iteratively. The operation stops once there is nochild to visit.Compared to the CUDA version, the OpenACC version of BFS spends lesstime on memory transfers. This can be explained by the fact that the OpenACCversion performs data initializations on the GPU. However, the CUDA versioninitializes the inputs on the host and transfers the inputs to GPU. Comparedto the CUDA version, OpenACC spends more time on kernel execution, sinceit forces a debilitating reduction on a global variable. The global variable is aboolean indicating whether there remains more nodes to visit or not. CUDAavoids global reduction by initializing the variable to FALSE on the host andimposing a control-flow divergent in the kernel to guard the global variable fromFALSE writes (allowing TRUE writes).
Dyadic Convolution.
Dyadic Convolution (dyadic.) is an algebra oper-ation calculating the XOR-convolution of two sequences. The OpenACC im-plementation parallelizes output calculations, where each thread calculates oneoutput element. Although this implementation is fast to develop, it exhibits ahigh number of irregular memory accesses. To mitigate irregular memory ac-cesses, the CUDA version uses Fast Walsch-Hadamard Transformation (FWHT)for implementing dyadic convolution (as described in [1]).As reported in Figure 2, both OpenACC and CUDA versions spend al-most the same amount of time on memory transfers. While the CUDA versionlaunches several kernels, OpenACC launches only one kernel. This explainswhy the CUDA version imposes higher kernel launch overhead. In CUDA thekernels’ execution time is 82% faster than OpenACC. This is due to the factthat the CUDA version uses FWHT to mitigate irregular memory accesses. Al-though OpenACC can implement dyadic convolution using FWHT, the sameFWHT algorithm used in CUDA cannot be implemented in OpenACC. CUDAFWHT uses shared memory to share intermediate writes locally between neigh-9or threads, which is not possible under OpenACC standard.
Hotspot.
Hotspot simulates chip characteristics to model the temperatureof individual units. At every iteration, the algorithm reads the temperatureand power consumption of each unit and calculates new temperatures. Al-though both OpenACC and CUDA spend the same amount of time on memorytransfers, CUDA kernel is much faster.In Hotspot, the temperature of each unit depends on its power consumptionand neighbors’ temperatures. CUDA kernel exploits this behavior to localizethe communication and reduce global memory accesses as follows. In CUDA,threads of the same thread block calculate the temperature of neighbor units.The CUDA version locally updates the new temperature of neighbor units usingthe threads of the same thread block. This local communication reduces thenumber of kernel launches used to synchronize the temperature across all threadblocks, explaining why the CUDA version performs faster kernel launches andcomes with shorter execution time. In OpenACC, unlike CUDA, the software-managed cache cannot be exploited for local communication. Hence, In Ope-nACC there are higher number of global synchronizations and kernel launches,which in turn harms performance.
Matrix Multiplication.
Matrix Multiplication (Matrix Mul.) performsmultiplication of two 1024x1024 matrices. Both CUDA and OpenACC imple-mentations use output parallelization, calculating each element of the outputmatrix in parallel. CUDA version is different from OpenACC as it processesinput matrices tile-by-tile. By processing in tiles, CUDA version fetches the in-put tiles in few well-coalesced accesses into software-managed cache and sharesthe tiles among the threads of the same thread block.While kernel launch and memory transfer times are nearly the same acrossCUDA and OpenACC, CUDA kernel time is much lower than OpenACC. CUDAversion takes advantage of software-managed cache in two ways. First, CUDAversion merges the required data of the thread block and fetches them once,minimizing redundant memory accesses across thread of the same thread block.Second, software-managed cache removes cache conflict misses, since the re-placement policy is controlled by the programmer. Under OpenACC, althoughthe threads have very high spatial locality, parsing the matrix row-by-row at atime highly pollutes the cache, returning high number of conflict misses. Alsohaving multiple thread blocks per SM exacerbates this effect.
N-Body simulation.
N-Body models a system of particles under the in-fluence of gravity force. In each timestep, operations of O( N ) complexity areperformed (for a system of N particles) to calculate forces between all pairs ofparticles. Inherently, there are many redundant memory reads, since the massand position information of each particle is fetched by other particles N-1 timesto calculate its interaction with other particles.While both CUDA and OpenACC memory transfers take about the sametime, CUDA kernels are much faster. The CUDA version tiles the computationsto reduce redundant memory reads [11]. CUDA exploits shared memory to sharethe particles among all threads of a thread block. In OpenACC, however, theredundant memory accesses are not filtered out by the software-managed cache.10s reported, redundant memory accesses can degrade performance significantly. Nearest Neighbor.
Nearest Neighbor (Nearest.) finds the five closestpoints to a target position. The Euclidean distance between the target positionand each of the points is calculated and the top five points with the lowestdistance are returned. OpenACC and CUDA versions both calculate Euclideandistances for each point in parallel. OpenACC and CUDA versions spend aboutthe same time on kernel launch, kernel execution, and memory transfer. This isexplained by the similarity of parallelization methods applied in both OpenACCand CUDA.
Needleman-Wunsch.
Needleman-Wunsch (Needle.) is a sequence align-ment algorithm used in bioinformatics. In either CUDA or OpenACC, traversesa 2D matrix and updates the costs. Upon updating a new cost, four memorylocations are read and one location is written.Although both CUDA and OpenACC versions spend the same amount oftime on memory transfers, CUDA kernel launch/executions are much fasterthan OpenACC kernels. The CUDA version fetches a data chunk of costs ma-trix into shared memory and traverses the matrix at the shared memory band-width. This mechanism comes with three advantages: i) filtering redundantglobal memory accesses by shared memory, ii) minimizing global communica-tion by sharing intermediate results stored in the shared memory, iii) reducingthe number of kernel launches and global communications. The fewer numberof kernel launches explains why the launch time of CUDA is much less thanOpenACC.
Pathfinder.
In Pathfinder (Pathfin.) kernel, every working element itera-tively finds the minimum of three consequent elements in an array. The CUDAversion of Pathfinder performs two optimizations: i) finding the minimum byaccessing the data from shared memory, and ii) sharing the updated minimumlocally among neighbor threads for certain iterations and then reflecting thechanges globally to other threads. Such local communications reduce the num-ber of global synchronizations and kernel launches.However, OpenACC’s API is not flexible enough to allow the programmerexploit the shared memory in a similar way. Therefore neighbor threads inthe OpenACC version do not communicate via shared memory. Therefore,each thread fetches the same data multiple times and threads communicateonly through global memory. Communication through global memory is im-plemented through consequent kernel launches. This explains why OpenACCimposes higher kernel launch overhead.
Speckle reducing anisotropic diffusion.
Speckle reducing anisotropicdiffusion (SRAD) is an image processing benchmark performing noise reduc-tion through partial differential equations iteratively. Compared to CUDA, thekernel time of OpenACC version is less. Three code blocks construct the com-putation iterative body of this benchmark: one reduction region and two dataparallel computations. Our evaluation shows OpenACC version performs 5%slower than CUDA, upon executing two data parallel computations. However,OpenACC outperforms CUDA in executing the reduction portion. This is ex-plained by the difference in reduction implementations. Our OpenACC frame-11ork performs the reduction in two levels: reducing along threads of threadblock on GPU and reducing along thread block on CPU. In the CUDA version,however, reduction is performed by multiple serial kernel launches, all on theGPU. The OpenACC version spends less time on executing the kernel as partof the computation is carried on host. Meanwhile, performing two levels of re-duction imposes the overhead of copying intermediate data from GPU to CPU.This explains why the OpenACC version spends slightly more time on memorytransfers and less time on kernel launch/execution.
Reyes et al. [12] introduce an open-source tool, named accULL, to executeOpenACC applications on accelerators. The tool consists of a source to sourcecompiler and a runtime library. The compiler translates OpenACC notations tothe runtime library routines. The runtime library routines are implemented inboth CUDA and OpenCL. Tian et al. [14] introduce an OpenACC implemen-tation integrated in OpenUH [5]. They evaluate the impact of mapping loopiterations over GPU parallel work-items.Lee and Vetter [4] introduce a framework for compiling, debugging, andprofiling OpenACC applications. They also openarc directives allowing Ope-nACC programmer to map OpenACC arrays to CUDA memory spaces, includ-ing shared and texture memory spaces. They do not investigate the effectivenessof their proposal for these mappings. Based on the short introduction that theypresent, we believe their proposal for utilizing shared memory is different fromours in two ways. Firstly, while openarc directive needs programmer to separateshared memory array and corresponding global memory array in the code, fcwseparates the arrays automatically, based on the information presented by theprogrammer. Secondly, while openarc directive allows fine-grained control toOpenACC programmer to perform fetch, synchronization, and writeback, fcwimplicitly handles fetch, synchronization, and writeback. Based on these differ-ences, we consider fcw as a high-level proposal for utilizing SMC and openarcas a low-level fine-grained control over SMC.Nakao et al. [7] introduce XACC as an alternative to MPI+OpenACC pro-gramming model to harness the processing power of cluster of accelerators.XACC offers higher productivity since XACC abstractions reduce the program-ming efforts. Under small and medium problem sizes, XACC performs up to2.7 times faster than MPI+OpenACC. This higher performance comes fromthe PEACH2 interface that XACC communicates through. PEACH2 performsfaster than GPUDirect RDMA over InfiniBand under data transfer size of below256KB. Increasing the problem size, XACC and MPI+OpenACC perform com-parable, since the latency of PEACH2 and GPUDirect RDMA over InfiniBandwould be equal. 12 eferenceseferences