Is this you? Create Your Porfile

Chunye Gong

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chunye Gong is active.

Explore More

Publication

Featured researches published by Chunye Gong.

international conference on parallel processing | 2010

The Characteristics of Cloud Computing

Chunye Gong; Jie Liu; Qiang Zhang; Haitao Chen; Zhenghu Gong

Cloud computing emerges as one of the hottest topic in field of information technology. Cloud computing is based on several other computing research areas such as HPC, virtualization, utility computing and grid computing. In order to make clear the essential of cloud computing, we propose the characteristics of this area which make cloud computing being cloud computing and distinguish it from other research areas. The cloud computing has its own conceptional, technical, economic and user experience characteristics. The service oriented, loose coupling, strong fault tolerant, business model and ease use are main characteristics of cloud computing. Clear insights into cloud computing will help the development and adoption of this evolving technology both for academe and industry.

Computer Physics Communications | 2012

Particle transport with unstructured grid on GPU

Chunye Gong; Jie Liu; Haowei Huang; Zhenghu Gong

Abstract The method of discontinuous finite element discrete ordinates which involves inverting an operator by iteratively sweeping across a mesh from multiple directions is commonly used to solve the time-dependent particle transport equation. Graphics Processing Unit (GPU) provides great faculty in solving scientific applications. The particle transport with unstructured grid bringing forward several challenges while implemented on GPU. This paper presents an efficient implementation of particle transport with unstructured grid under 2D cylindrical Lagrange coordinates system on a fine-grained data level parallelism GPU platform from three aspects. The first one is determining the sweep order of elements from different angular directions. The second one is mapping the sweep calculation onto the GPU thread execution model. The last one is efficiently using the on-chip memory to improve performance. As to the authorsʼ knowledge, this is the first implementation of a general purpose particle transport simulation with unstructured grid on GPU. Experimental results show that the performance speedup of NVIDIA M2050 GPU with double precision floating operations ranges from 11.03 to 17.96 compared with the serial implementation on Intel Xeon X5355 and Core Q6600.

international conference on algorithms and architectures for parallel processing | 2010

Optimizing sweep3d for graphic processor unit

Chunye Gong; Jie Liu; Zhenghu Gong; Jin Qin; Jing Xie

As a powerful and flexible processor, the Graphic Processing Unit (GPU) can offer great faculty in solving many high-performance computing applications Sweep3D, which simulates a single group time-independent discrete ordinates (Sn) neutron transport deterministically on 3D Cartesian geometry space, represents the key part of a real ASCI application The wavefront process for parallel computation in Sweep3D limits the concurrent threads on the GPU In this paper, we present multi-dimensional optimization methods for Sweep3D, which can be efficiently implemented on the fine grained parallel architecture of the GPU Our results show that the performance of overall Sweep3D on CPU-GPU hybrid platform can be improved up to 2.25 times as compared to the CPU-based implementation.

Journal of Information Processing Systems | 2011

Accelerating the Sweep3D for a Graphic Processor Unit

Chunye Gong; Jie Liu; Haitao Chen; Jing Xie; Zhenghu Gong

Abstract —As a powerful and flexible processor, the Graphic Processing Unit (GPU) can offer a great faculty in solving many high-performance computing applications. Sweep3D, which simulates a single group time-independent discrete ordinates (Sn) neutron transport deterministically on 3D Cartesian geometry space, represents the key part of a real ASCI application. The wavefront process for parallel computation in Sweep3D limits the concurrent threads on the GPU. In this paper, we present multi-dimensional optimization methods for Sweep3D, which can be efficiently implemented on the fine-grained parallel architecture of the GPU. Our results show that the overall performance of Sweep3D on the CPU-GPU hybrid platform can be improved up to 4.38 times as compared to the CPU-based implementation. Keywords —Sweep3D, Neutron Transport, GPU, CUDA 1. I NTRODUCTION When the first GPU was introduced in 1999, the GPU mainly had been used to transform, light and to rasterize triangles in three dimension (3D) graphics applications [1]. The perform-ance of GPU doubles about every six to nine months, which means that it outperforms the Cen-tral Processing Unit (CPU) by a lot [2]. The modern GPUs are throughput-oriented parallel processors that can offer peak performance up to 2.72 Tflops single-precision floating-point and 544 Gflops double-precision floating-point [3]. At the same time, the GPU programming models, such as NVIDIA’s Compute Unified Device Architecture (CUDA) [4], AMD/ATI’s Streaming Computing [5] and OpenCL [6], have matures and they simplify the processing of developing non-graphics applications. The enhancement of computing performance, and the development of programming models and software makes GPU more and more suitable for general purpose computing. At present, GPU has been successfully applied to medical imaging, universe explo-ration, physics simulation, linear system solutions, and other computation intensive domains [7]. There is a growing need to accurately simulate physical systems whose evolutions depend on the transport of subatomic particles coupled with other complex physics [8]. In many simula-tions, particle transport calculations consume the majority of the computational resources. For example, the time devoted to particle transport problems in multi-physics simulations takes up

international conference on computer science and information technology | 2010

Hybrid Embarrassingly Parallel on heterogeneous platform

Chunye Gong; Jie Liu; Jin Qin; Qingfeng Hu; Zhenghu Gong

The Embarrassingly Parallel (EP) is one kernel benchmark of NAS Parallel Benchmarks (NPB). EP generates pairs of Gaussian Random Deviates (GRDs) of large random numbers which produced by Linear Congruential Generator (LCG). In this paper, the Hybrid EP is efficient implemented on CPU/GPU heterogeneous platform. Experimental results show that the Hybrid EP is 11.98 times faster than equivalent multicore CPU and outperforms equivalent GPU by 7.52%.

ICNAAM 2010: International Conference of Numerical Analysis and Applied Mathematics 2010 | 2010

Accelerating Pseudo‐Random Number Generator for MCNP on GPU

Chunye Gong; Jie Liu; Lihua Chi; Qingfeng Hu; Li Deng; Zhenghu Gong

Pseudo‐random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N‐Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA’s GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.

international conference on education technology and computer | 2010

Efficient Embarrassingly Parallel on Graphics Processor Unit

Chunye Gong; Jie Liu; Jin Qin; Qingfeng Hu; Zhenghu Gong

The Embarrassingly Parallel (EP) is one kernel benchmark of NAS Parallel Benchmarks (NPB) which are a set of programs designed to help evaluate the performance of parallel supercomputers. In the EP benchmark, two-dimensional statistics are accumulated from a large number of Gaussian pseudo-random numbers, which produced by Linear Congruential Generator (LCG). In this paper, we present the design and implementation of EP on the powerful Graphics Processor Unit Tesla T10 with CUDA. While keeping the main framework of NPB EP, comparative results show that the performance of our GPU-based implementation is up to 871.57 Mop/s. This is roughly 1.38 times faster than the throughput previously achieved on the same GPU and outperforms equivalent 4 cores CPU by about 11.33 times.

Concurrency and Computation: Practice and Experience | 2018

An efficient SIMD compression format for sparse matrix-vector multiplication

Xinhai Chen; Peizhen Xie; Lihua Chi; Jie Liu; Chunye Gong

Sparse matrix‐vector multiplication (SpMV) is an essential kernel in sparse linear algebra and has been studied extensively on all modern processor and accelerator architectures. Compressed Sparse Row (CSR) is a frequently used format for sparse matrices storage. However, CSR‐based SpMV has poor performance on processors with vector units. In order to take full advantage of SIMD acceleration technology in SpMV, we proposed a new matrix storage format called CSR‐SIMD. The new storage format compresses the non‐zero elements into many variable‐length data fragments with consecutive memory access addresses. Thus, the data locality of sparse matrix A and dense vector x expands and the floating‐point operations for each fragment can be completely calculated by vectorized implementation on wide SIMD units. Our experimental results indicate that CSR‐SIMD has better storage efficiency and low‐overhead for format conversion. Besides, the new format achieves high scalability on wide SIMD units. In comparison with the CSR‐based and BCSR‐based SpMV, CSR‐SIMD obtains better performance on FT1500A, Intel Xeon, and Intel Xeon Phi.

Science in China Series F: Information Sciences | 2018

Customizing the HPL for China accelerator

Xinbiao Gan; Yikun Hu; Jie Liu; Lihua Chi; Han Xu; Chunye Gong; Shengguo Li; Yihui Yan

HPL is a Linpack benchmark package widely used in high-performance computing tests. Customizing the HPL is crucial for a heterogeneous system equipped with CPU and the China accelerator because of the complexity of the China accelerator and the specified interface on matrix multiplication built in the China accelerator. Therefore, it is advisable to use delicate partition and encapsulation on matrix (DPEM) to expose a friendly testing configuration. More importantly, we propose the orchestrating algorithm for matrix multiplication (OAMM) to enhance the efficiency of the heterogeneous system composed of CPU and China accelerator. Furthermore, optimization at vectorization (OPTVEC) is applied to shield the architectural details of the vector processing element (VPE) equipped in the China accelerator. The experimental results validate DPEM, OPTVEC and OAMM. OPTVEC optimizations would speed up matrix multiplication more than twofold, moreover OAMM would improve productivity by up to 10% compared to the traditional HPL tested in a heterogeneous system.

IEEE Conference Anthology | 2013

GPU optimized Pseudo Random Number Generator for MCNP

Bo Yang; Qingfeng Hu; Jie Liu; Chunye Gong

The Monte Carlo particle transport algorithms are ideally suited to parallel processing architectures and so are good candidates for acceleration using a Graphics Processor Unit (GPU). As the foundation of Monte Carlo N-Particle Transport Code (MCNP), Pseudo Random Number Generator (PRNG) should be provided with some specified nature such as long period, high quality and fast generation. Newer NVIDIA Fermi architecture based GPU offer a dramatic performance improvement in double precision, which provides a good fundament for an effective implementation of PRNG. This paper presents an effective implementation of the 48bit PRNG algorithm proposed in MPI version of MCNP on GPU. After the optimization of GPU memory utilization and execution parameters of our PRNG, experimental results show that the performance speedup of one NVIDIA M2050 GPU with full double precision floating operations is up to 11-fold factor compared with the parallel implementation on one multi-core Intel Xeon X5670.

Explore More