Rengan Xu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rengan Xu is active.

Explore More

Publication

Featured researches published by Rengan Xu.

languages and compilers for parallel computing | 2013

Compiling a High-level Directive-Based Programming Model for GPGPUs

Xiaonan Tian; Rengan Xu; Yonghong Yan; Zhifeng Yun; Sunita Chandrasekaran; Barbara M. Chapman

OpenACC is an emerging directive-based programming model for programming accelerators that typically enable non-expert programmers to achieve portable and productive performance of their applications. In this paper, we present the research and development challenges, and our solutions to create an open-source OpenACC compiler in a main stream compiler framework (OpenUH of a branch of Open64). We discuss in details our loop mapping techniques, i.e. how to distribute loop iterations over the GPGPU’s threading architectures, as well as their impacts on performance. The runtime support of this programming model are also presented. The compiler was evaluated with several commonly used benchmarks, and delivered similar performance to those obtained using a commercial compiler. We hope this implementation to serve as compiler infrastructure for researchers to explore advanced compiler techniques, to extend OpenACC to other programming languages, or to build performance tools used with OpenACC programs.

ieee international conference on high performance computing data and analytics | 2014

SPEC ACCEL : a Standard Application Suite for Measuring Hardware Accelerator Performance

Guido Juckeland; William C. Brantley; Sunita Chandrasekaran; Barbara M. Chapman; Shuai Che; Mathew E. Colgrove; Huiyu Feng; Alexander Grund; Robert Henschel; Wen-mei W. Hwu; Huian Li; Matthias S. Müller; Wolfgang E. Nagel; Maxim Perminov; Pavel Shelepugin; Kevin Skadron; John A. Stratton; Alexey Titov; Ke Wang; G. Matthijs van Waveren; Brian Whitney; Sandra Wienke; Rengan Xu; Kalyan Kumaran

Hybrid nodes with hardware accelerators are becoming very common in systems today. Users often find it difficult to characterize and understand the performance advantage of such accelerators for their applications. The SPEC High Performance Group (HPG) has developed a set of performance metrics to evaluate the performance and power consumption of accelerators for various science applications. The new benchmark comprises two suites of applications written in OpenCL and OpenACC and measures the performance of accelerators with respect to a reference platform. The first set of published results demonstrate the viability and relevance of the new metrics in comparing accelerator performance. This paper discusses the benchmark suites and selected published results in great detail.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model

Rengan Xu; Sunita Chandrasekaran; Barbara M. Chapman

Heterogeneous computing come with tremendous potential and is a leading candidate for scientific applications that are becoming more and more complex. Accelerators such as GPUs whose computing momentum is growing faster than ever offer application performance when compute intensive portions of an application are offloaded to them. It is quite evident that future computing architectures are moving towards hybrid systems consisting of multi-GPUs and multi-core CPUs. A variety of high-level languages and software tools can simplify programming these systems. Directive-based programming models are being embraced since they not only ease programming complex systems but also abstract low-level details from the programmer. We already know that OpenMP has been making programming CPUs easy and portable. Similarly, a directive-based programming model for accelerators is OpenACC that is gaining popularity since the directives play an important role in developing portable software for GPUs. A combination of OpenMP and OpenACC, a hybrid model, is a plausible solution to port scientific applications to heterogeneous architectures especially when there is more than one GPU on a single node to port an application to. However OpenACC meant for accelerators is yet to provide support for multi-GPUs. But using OpenMP we could conveniently exploit features such as for and section to distribute compute intensive kernels to more than one GPU. We demonstrate the effectiveness of this hybrid approach with some case studies in this paper.

languages and compilers for parallel computing | 2014

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

Rengan Xu; Xiaonan Tian; Sunita Chandrasekaran; Yonghong Yan; Barbara M. Chapman

The broad adoption of accelerators boosts the interest in accelerator programming. Accelerators such as GPGPUs are optimized for throughput and offer high GFLOPS and memory bandwidth. CUDA has been adopted quite rapidly but it is proprietary and only applicable to GPUs, and the difficulty in writing efficient CUDA code has kindled the necessity to create higher-level programming approaches such as OpenACC. Directive-based programming models such as OpenMP and OpenACC offer programmers an option to rapidly create prototype applications by adding annotations to guide compiler optimizations. In this paper we study the effectiveness of a high-level directive based programming model, OpenACC, for parallelizing NAS Parallel Benchmarks (NPB) on GPGPUs. We present the application of techniques such as array privatization, memory coalescing, cache optimization and examine their impact on the performance of the benchmarks. The right choice or combination of techniques/hints are crucial for compilers to generate highly efficient codes tuned to a particular type of accelerator. Poorly selected choice or combination of techniques can lead to degraded performance. We also propose a new clause, ‘scan’, that handles scan operations for arbitrary input array size. We hope that the practices discussed in this paper will provide useful guidance to users to effectively migrate their sequential/CPU-parallel codes to GPGPU architectures and achieve optimal performance.

international parallel and distributed processing symposium | 2014

A Validation Testsuite for OpenACC 1.0

Cheng Wang; Rengan Xu; Sunita Chandrasekaran; Barbara M. Chapman; Oscar R. Hernandez

Directive-based programming models provide high-level of abstraction thus hiding complex low-level details of the underlying hardware from the programmer. One such model is OpenACC that is also a portable programming model allowing programmers to write applications that offload portions of work from a host CPU to an attached accelerator (GPU or a similar device). The model is gaining popularity and being used for accelerating many types of applications, ranging from molecular dynamics codes to particle physics models. It is critical to evaluate the correctness of the OpenACC implementations and determine its conformance to the specification. In this paper, we present a robust and scalable testing infrastructure that serves this purpose. We worked very closely with three main vendors that offer compiler support for OpenACC and assisted them in identifying and resolving compiler bugs helping them improve the quality of their compilers. The testsuite also aims to identify and resolve ambiguities within the OpenACC specification. This testsuite has been integrated into the harness infrastructure of the TITAN machine at Oak Ridge National Lab and is being used for production. The testsuite consists of test cases for all the directives and clauses of OpenACC, both for C and Fortran languages. The testsuite discussed in this paper focuses on the OpenACC 1.0 feature set. The framework of the testsuite is robust enough to create test cases for 2.0 and future releases. This work is in progress.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Filesystem Aware Scalable I/O Framework for Data-Intensive Parallel Applications

Rengan Xu; Mauricio Araya-Polo; Barbara M. Chapman

The growing speed gap between CPU and memory makes I/O the main bottleneck of many industrial applications. Some applications need to perform I/O operations for very large volume of data frequently, which will harm the performance seriously. This works motivation are geophysical applications used for oil and gas exploration. These applications process Terabyte size datasets in HPC facilities. The datasets represent subsurface models and field recorded data. In general term, these applications read as inputs and write as intermediate/final results huge amount of data, where the underlying algorithms implement seismic imaging techniques. The traditional sequential I/O, even when couple with advance storage systems, cannot complete all I/O operations for so large volumes of data in an acceptable time range. Parallel I/O is the general strategy to solve such problems. However, because of the dynamic property of many of these applications, each parallel process does not know the data size it needs to write until its computation is done, and it also cannot identify the position in the file to write. In order to write correctly and efficiently, communication and synchronization are required among all processes to fully exploit the parallel I/O paradigm. To tackle these issues, we use a dynamic load balancing framework that is general enough for most of these applications. And to reduce the expensive synchronization and communication overhead, we introduced a I/O node that only handles I/O request and let compute nodes perform I/O operations in parallel. By using both POSIX I/O and memory-mapping interfaces, the experiment indicates that our approach is scalable. For instance, with 16 processes, the bandwidth of parallel reading can reach the theoretical peak performance (2.5 GB/s) of the storage infrastructure. Also, the parallel writing can be up to 4.68x (speedup, POSIX I/O) and 7.23x (speedup, memory-mapping) more efficient than the serial I/O implementation. Since, most geophysical applications are I/O bounded, these results positively impact the overall performance of the application, and confirm the chosen strategy as path to follow.

Concurrency and Computation: Practice and Experience | 2016

Compiler transformation of nested loops for general purpose GPUs

Xiaonan Tian; Rengan Xu; Yonghong Yan; Sunita Chandrasekaran; Deepak Eachempati; Barbara M. Chapman

Manycore accelerators have the potential to significantly improve performance of scientific applications when offloading computationally intensive program portions to accelerators. Directive‐based high‐level programming models, such as OpenACC and OpenMP, are used to create applications for accelerators through annotating regions of code meant for offloading. OpenACC is an emerging directive‐based programming model for programming accelerators that typically enable inexperienced programmers to achieve portable and productive performance within applications. In this paper, we present our research in developing challenges and solutions when creating an open‐source OpenACC compiler in an industrial framework (OpenUH as a branch of Open64). We then discuss in detail techniques we developed for loop scheduling reduction operations on general purpose GPUs. The compiler is evaluated with benchmarks from the NAS Parallel Benchmarks suite and self‐written micro‐benchmarks for reduction operations. This implementation has been designed to serve as a compiler infrastructure for researchers to explore advanced compiler techniques, extend OpenACC to other programming models, and build performance tools used in conjunction with OpenACC programs. Copyright

international parallel and distributed processing symposium | 2017

Implementing the OpenACC Data Model

Michael Wolfe; Seyong Lee; Jungwon Kim; Xiaonan Tian; Rengan Xu; Sunita Chandrasekaran; Barbara M. Chapman

Programming accelerators today usually requires managing separate virtual and physical memories, such as allocating space in and copying data between host and device memories. The OpenACC API provides data directives and clauses to control this behavior where it is required. This paper describes how the data model is supported in current OpenACC implementations, ranging from research compilers (OpenUH and OpenARC) to a commercial compiler (the PGI OpenACC compiler). This includes implementation of the data directives and clauses, testing whether the data is already present on the device, OpenCL support, managing asynchronous data trans- fers and memory allocations, handling aliased data, reusing device memory, managing partially present data, and support for shared memory between host and device. Lastly, it also discusses on-going work to manage large, complex dynamic data structures.

ieee international conference on high performance computing, data, and analytics | 2016

An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop Scheduling

Rengan Xu; Sunita Chandrasekaran; Xiaonan Tian; Barbara M. Chapman

HPC developers aim to deliver the very best performance. To do so they constantly think about memory bandwidth, memory hierarchy, locality, floating point performance, power/energy constraints and so on. On the other hand, application scientists aim to write performance portable code while exploiting the rich feature set of the hardware. By providing adequate hints to the compilers in the form of directives appropriate executable code is generated. There are tremendous benefits from using directive-based programming. However, applications are also becoming more and more complex and we need sophisticated tools such as auto-tuning to better explore the optimization space. In applications, loops typically form a major and time-consuming portion of the code. Scheduling these loops involves mapping from the loop iteration space to the underlying platform - for example GPU threads. The user tries different scheduling techniques until the best one is identified. However, this process can be quite tedious and time consuming especially when it is a relatively large application, as the user needs to record the performance of every schedule’s run. This paper aims to offer a better solution by proposing an auto-tuning framework that adopts an analytical model guiding the compiler and the runtime to choose an appropriate schedule for the loops, automatically and determining the launch configuration for each of the loop schedules. Our experiments show that the predicted loop schedule by our framework achieves the speedup of 1.29x on an average against the default loop schedule chosen by the compiler.

Proceedings of the First Workshop on Accelerator Programming using Directives | 2014

Accelerating Kirchhoff migration on GPU using directives

Rengan Xu; Maxime R. Hugues; Henri Calandra; Sunita Chandrasekaran; Barbara M. Chapman

Accelerators offer the potential to significantly improve the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. However, effectively tapping their full potential is difficult owing to the programmability challenges faced by the users when mapping computation algorithms to the massively parallel architectures such as GPUs.Directive-based programming models offer programmers an option to rapidly create prototype applications by annotating region of code for offloading with hints to the compiler. This is critical to improve the productivity in the production code. In this paper, we study the effectiveness of a high-level directivebased programming model, OpenACC, for parallelizing a seismic migration application called Kirchhoff Migration on GPU architecture. Kirchhoff Migration is a real-world production code in the Oil & Gas industry. Because of its compute intensive property, we focus on the computation part and explore different mechanisms to effectively harness GPUs computation capabilities and memory hierarchy. We also analyze different loop transformation techniques in different OpenACC compilers and compare their performance differences. Compared toone socket (10 CPU cores) on the experimental platform, one GPU achieved a maximum speedup of 20.54x and 6.72x for interpolation and extrapolation kernel functions.

Explore More