Xiaonan Tian | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiaonan Tian is active.

Explore More

Publication

Featured researches published by Xiaonan Tian.

languages and compilers for parallel computing | 2013

Compiling a High-level Directive-Based Programming Model for GPGPUs

Xiaonan Tian; Rengan Xu; Yonghong Yan; Zhifeng Yun; Sunita Chandrasekaran; Barbara M. Chapman

OpenACC is an emerging directive-based programming model for programming accelerators that typically enable non-expert programmers to achieve portable and productive performance of their applications. In this paper, we present the research and development challenges, and our solutions to create an open-source OpenACC compiler in a main stream compiler framework (OpenUH of a branch of Open64). We discuss in details our loop mapping techniques, i.e. how to distribute loop iterations over the GPGPU’s threading architectures, as well as their impacts on performance. The runtime support of this programming model are also presented. The compiler was evaluated with several commonly used benchmarks, and delivered similar performance to those obtained using a commercial compiler. We hope this implementation to serve as compiler infrastructure for researchers to explore advanced compiler techniques, to extend OpenACC to other programming languages, or to build performance tools used with OpenACC programs.

languages and compilers for parallel computing | 2014

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

Rengan Xu; Xiaonan Tian; Sunita Chandrasekaran; Yonghong Yan; Barbara M. Chapman

The broad adoption of accelerators boosts the interest in accelerator programming. Accelerators such as GPGPUs are optimized for throughput and offer high GFLOPS and memory bandwidth. CUDA has been adopted quite rapidly but it is proprietary and only applicable to GPUs, and the difficulty in writing efficient CUDA code has kindled the necessity to create higher-level programming approaches such as OpenACC. Directive-based programming models such as OpenMP and OpenACC offer programmers an option to rapidly create prototype applications by adding annotations to guide compiler optimizations. In this paper we study the effectiveness of a high-level directive based programming model, OpenACC, for parallelizing NAS Parallel Benchmarks (NPB) on GPGPUs. We present the application of techniques such as array privatization, memory coalescing, cache optimization and examine their impact on the performance of the benchmarks. The right choice or combination of techniques/hints are crucial for compilers to generate highly efficient codes tuned to a particular type of accelerator. Poorly selected choice or combination of techniques can lead to degraded performance. We also propose a new clause, ‘scan’, that handles scan operations for arbitrary input array size. We hope that the practices discussed in this paper will provide useful guidance to users to effectively migrate their sequential/CPU-parallel codes to GPGPU architectures and achieve optimal performance.

Concurrency and Computation: Practice and Experience | 2016

Compiler transformation of nested loops for general purpose GPUs

Xiaonan Tian; Rengan Xu; Yonghong Yan; Sunita Chandrasekaran; Deepak Eachempati; Barbara M. Chapman

Manycore accelerators have the potential to significantly improve performance of scientific applications when offloading computationally intensive program portions to accelerators. Directive‐based high‐level programming models, such as OpenACC and OpenMP, are used to create applications for accelerators through annotating regions of code meant for offloading. OpenACC is an emerging directive‐based programming model for programming accelerators that typically enable inexperienced programmers to achieve portable and productive performance within applications. In this paper, we present our research in developing challenges and solutions when creating an open‐source OpenACC compiler in an industrial framework (OpenUH as a branch of Open64). We then discuss in detail techniques we developed for loop scheduling reduction operations on general purpose GPUs. The compiler is evaluated with benchmarks from the NAS Parallel Benchmarks suite and self‐written micro‐benchmarks for reduction operations. This implementation has been designed to serve as a compiler infrastructure for researchers to explore advanced compiler techniques, extend OpenACC to other programming models, and build performance tools used in conjunction with OpenACC programs. Copyright

international parallel and distributed processing symposium | 2017

Implementing the OpenACC Data Model

Michael Wolfe; Seyong Lee; Jungwon Kim; Xiaonan Tian; Rengan Xu; Sunita Chandrasekaran; Barbara M. Chapman

Programming accelerators today usually requires managing separate virtual and physical memories, such as allocating space in and copying data between host and device memories. The OpenACC API provides data directives and clauses to control this behavior where it is required. This paper describes how the data model is supported in current OpenACC implementations, ranging from research compilers (OpenUH and OpenARC) to a commercial compiler (the PGI OpenACC compiler). This includes implementation of the data directives and clauses, testing whether the data is already present on the device, OpenCL support, managing asynchronous data trans- fers and memory allocations, handling aliased data, reusing device memory, managing partially present data, and support for shared memory between host and device. Lastly, it also discusses on-going work to manage large, complex dynamic data structures.

ieee international conference on high performance computing, data, and analytics | 2016

An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop Scheduling

Rengan Xu; Sunita Chandrasekaran; Xiaonan Tian; Barbara M. Chapman

HPC developers aim to deliver the very best performance. To do so they constantly think about memory bandwidth, memory hierarchy, locality, floating point performance, power/energy constraints and so on. On the other hand, application scientists aim to write performance portable code while exploiting the rich feature set of the hardware. By providing adequate hints to the compilers in the form of directives appropriate executable code is generated. There are tremendous benefits from using directive-based programming. However, applications are also becoming more and more complex and we need sophisticated tools such as auto-tuning to better explore the optimization space. In applications, loops typically form a major and time-consuming portion of the code. Scheduling these loops involves mapping from the loop iteration space to the underlying platform - for example GPU threads. The user tries different scheduling techniques until the best one is identified. However, this process can be quite tedious and time consuming especially when it is a relatively large application, as the user needs to record the performance of every schedule’s run. This paper aims to offer a better solution by proposing an auto-tuning framework that adopts an analytical model guiding the compiler and the runtime to choose an appropriate schedule for the loops, automatically and determining the launch configuration for each of the loop schedules. Our experiments show that the predicted loop schedule by our framework achieves the speedup of 1.29x on an average against the default loop schedule chosen by the compiler.

ieee international conference on high performance computing data and analytics | 2012

Performance and Power Characteristics of Matrix Multiplication Algorithms on Multicore and Shared Memory Machines

Yonghong Yan; Jeremy Kemp; Xiaonan Tian; Abid Muslim Malik; Barbara M. Chapman

For many scientific applications, dense matrix multiplication is one of the most important and computation intensive linear algebra operations. An efficient matrix multiplication on high performance and parallel computers requires optimizations on how matrices are decomposed and exchanged between computational nodes to reduce communication and synchronization overhead, as well as to efficiently exploit the memory hierarchy within a node to improve both spatial and temporal data locality. In this paper, we presented our studies of performance, cache behavior, and energy efficiency of multiple parallel matrix multiplication algorithms on a multicore desktop computer and a medium-size shared memory machine, both being considered as referenced sizes of nodes to create a medium- and largescale computational clusters for high performance computing used in industry and national laboratories. Our results highlight both the performance and energy efficiencies, and also provide implications on the memory and resources pressures of those algorithms. We hope this could help users choose the appropriate implementations according to their specific data sets when composing larger-scale scientific applications that use parallel matrix multiplication kernels on a node.

programming models and applications for multicores and manycores | 2017

Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs

Chen Shen; Xiaonan Tian; Dounia Khaldi; Barbara M. Chapman

The proliferation of accelerators in modern clusters makes efficient coprocessor programming a key requirement if application codes are to achieve high levels of performance with acceptable energy consumption on such platforms. This has led to considerable effort to provide suitable programming models for these accelerators, especially within the OpenMP community. While OpenMP 4.5 offers a rich set of directives, clauses and runtime calls to fully utilize accelerators, an efficient implementation of OpenMP 4.5 for GPUs remains a non-trivial task, given their multiple levels of thread parallelism. In this paper, we describe a new implementation of the corresponding features of OpenMP 4.5 for GPUs based on a one-to-one mapping of its loop hierarchy parallelism to the GPU thread hierarchy. We assess the impact of this mapping, in particular the use of GPU warps to handle innermost loop execution, on the performance of GPU execution via a set of benchmarks that include a version of the NAS parallel benchmarks specifically developed for this research; we also used the Matrix-Matrix multiplication, Jacobi, Gauss and Laplacian kernels.

international conference on parallel processing | 2016

Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations

Xiaonan Tian; Dounia Khaldi; Deepak Eachempati; Rengan Xu; Barbara M. Chapman

Using compiler directives to program accelerator-based systems through APIs such as OpenACC or OpenMP has increasingly gained popularity due to the portability and productivity advantages it offers. However, when comparing the performance typically achieved to what lower-level programming interfaces such as CUDA or OpenCL provides, directive-based approaches may entail a significant performance penalty. To support massively parallel computations, accelerators such as GPGPUs offer an expansive set of registers, larger than even the L1 cache, to hold the temporary state of each thread. Scalar variables are the mostly likely candidates to be assigned to these registers by the compiler. Hence, scalar replacement is a key enabling optimization for effectively improving the utilization of register files on accelerator devices and thereby substantially reducing the cost of memory operations. However, the aggressive application of scalar replacement may require a large number of registers, limiting the application of this technique unless mitigating approaches such as those described in this paper are taken. In this paper, we propose solutions to optimize the register usage within offloaded computations using OpenACC directives. We first present a compiler optimization called SAFARA that extends the classical scalar replacement algorithm to improve register file utilization on GPUs. Moreover, we extend the OpenACC interface by providing new clauses, namely dim and small, that will reduce the number of scalars to replace. SAFARA prioritizes the most beneficial data for allocation in registers based on frequency of use and also memory access latency. It also uses a static feedback strategy to retrieve low-level register information in order to guide the compiler in carrying out the scalar replacement transformation. Then, the new clauses we propose will extremely reduce the number of scalars, eliminating the need for more registers. We evaluate SAFARA and the new clauses using SPEC and NAS OpenACC benchmarks, our results suggest that these approaches will be effective for improving overall performance of code executing on GPUs. We got up to 2.5 speedup running NAS and 2.08 speedup while running SPEC benchmarks.

Scientific Programming | 2015