Chunye Gong
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Chunye Gong.
international conference on parallel processing | 2010
Chunye Gong; Jie Liu; Qiang Zhang; Haitao Chen; Zhenghu Gong
Cloud computing emerges as one of the hottest topic in field of information technology. Cloud computing is based on several other computing research areas such as HPC, virtualization, utility computing and grid computing. In order to make clear the essential of cloud computing, we propose the characteristics of this area which make cloud computing being cloud computing and distinguish it from other research areas. The cloud computing has its own conceptional, technical, economic and user experience characteristics. The service oriented, loose coupling, strong fault tolerant, business model and ease use are main characteristics of cloud computing. Clear insights into cloud computing will help the development and adoption of this evolving technology both for academe and industry.
Computer Physics Communications | 2012
Chunye Gong; Jie Liu; Haowei Huang; Zhenghu Gong
Abstract The method of discontinuous finite element discrete ordinates which involves inverting an operator by iteratively sweeping across a mesh from multiple directions is commonly used to solve the time-dependent particle transport equation. Graphics Processing Unit (GPU) provides great faculty in solving scientific applications. The particle transport with unstructured grid bringing forward several challenges while implemented on GPU. This paper presents an efficient implementation of particle transport with unstructured grid under 2D cylindrical Lagrange coordinates system on a fine-grained data level parallelism GPU platform from three aspects. The first one is determining the sweep order of elements from different angular directions. The second one is mapping the sweep calculation onto the GPU thread execution model. The last one is efficiently using the on-chip memory to improve performance. As to the authorsʼ knowledge, this is the first implementation of a general purpose particle transport simulation with unstructured grid on GPU. Experimental results show that the performance speedup of NVIDIA M2050 GPU with double precision floating operations ranges from 11.03 to 17.96 compared with the serial implementation on Intel Xeon X5355 and Core Q6600.
international conference on algorithms and architectures for parallel processing | 2010
Chunye Gong; Jie Liu; Zhenghu Gong; Jin Qin; Jing Xie
As a powerful and flexible processor, the Graphic Processing Unit (GPU) can offer great faculty in solving many high-performance computing applications Sweep3D, which simulates a single group time-independent discrete ordinates (Sn) neutron transport deterministically on 3D Cartesian geometry space, represents the key part of a real ASCI application The wavefront process for parallel computation in Sweep3D limits the concurrent threads on the GPU In this paper, we present multi-dimensional optimization methods for Sweep3D, which can be efficiently implemented on the fine grained parallel architecture of the GPU Our results show that the performance of overall Sweep3D on CPU-GPU hybrid platform can be improved up to 2.25 times as compared to the CPU-based implementation.
Journal of Information Processing Systems | 2011
Chunye Gong; Jie Liu; Haitao Chen; Jing Xie; Zhenghu Gong
Abstract —As a powerful and flexible processor, the Graphic Processing Unit (GPU) can offer a great faculty in solving many high-performance computing applications. Sweep3D, which simulates a single group time-independent discrete ordinates (Sn) neutron transport deterministically on 3D Cartesian geometry space, represents the key part of a real ASCI application. The wavefront process for parallel computation in Sweep3D limits the concurrent threads on the GPU. In this paper, we present multi-dimensional optimization methods for Sweep3D, which can be efficiently implemented on the fine-grained parallel architecture of the GPU. Our results show that the overall performance of Sweep3D on the CPU-GPU hybrid platform can be improved up to 4.38 times as compared to the CPU-based implementation. Keywords —Sweep3D, Neutron Transport, GPU, CUDA 1. I NTRODUCTION When the first GPU was introduced in 1999, the GPU mainly had been used to transform, light and to rasterize triangles in three dimension (3D) graphics applications [1]. The perform-ance of GPU doubles about every six to nine months, which means that it outperforms the Cen-tral Processing Unit (CPU) by a lot [2]. The modern GPUs are throughput-oriented parallel processors that can offer peak performance up to 2.72 Tflops single-precision floating-point and 544 Gflops double-precision floating-point [3]. At the same time, the GPU programming models, such as NVIDIA’s Compute Unified Device Architecture (CUDA) [4], AMD/ATI’s Streaming Computing [5] and OpenCL [6], have matures and they simplify the processing of developing non-graphics applications. The enhancement of computing performance, and the development of programming models and software makes GPU more and more suitable for general purpose computing. At present, GPU has been successfully applied to medical imaging, universe explo-ration, physics simulation, linear system solutions, and other computation intensive domains [7]. There is a growing need to accurately simulate physical systems whose evolutions depend on the transport of subatomic particles coupled with other complex physics [8]. In many simula-tions, particle transport calculations consume the majority of the computational resources. For example, the time devoted to particle transport problems in multi-physics simulations takes up
international conference on computer science and information technology | 2010
Chunye Gong; Jie Liu; Jin Qin; Qingfeng Hu; Zhenghu Gong
The Embarrassingly Parallel (EP) is one kernel benchmark of NAS Parallel Benchmarks (NPB). EP generates pairs of Gaussian Random Deviates (GRDs) of large random numbers which produced by Linear Congruential Generator (LCG). In this paper, the Hybrid EP is efficient implemented on CPU/GPU heterogeneous platform. Experimental results show that the Hybrid EP is 11.98 times faster than equivalent multicore CPU and outperforms equivalent GPU by 7.52%.
ICNAAM 2010: International Conference of Numerical Analysis and Applied Mathematics 2010 | 2010
Chunye Gong; Jie Liu; Lihua Chi; Qingfeng Hu; Li Deng; Zhenghu Gong
Pseudo‐random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N‐Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA’s GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.
international conference on education technology and computer | 2010
Chunye Gong; Jie Liu; Jin Qin; Qingfeng Hu; Zhenghu Gong
The Embarrassingly Parallel (EP) is one kernel benchmark of NAS Parallel Benchmarks (NPB) which are a set of programs designed to help evaluate the performance of parallel supercomputers. In the EP benchmark, two-dimensional statistics are accumulated from a large number of Gaussian pseudo-random numbers, which produced by Linear Congruential Generator (LCG). In this paper, we present the design and implementation of EP on the powerful Graphics Processor Unit Tesla T10 with CUDA. While keeping the main framework of NPB EP, comparative results show that the performance of our GPU-based implementation is up to 871.57 Mop/s. This is roughly 1.38 times faster than the throughput previously achieved on the same GPU and outperforms equivalent 4 cores CPU by about 11.33 times.
Concurrency and Computation: Practice and Experience | 2018
Xinhai Chen; Peizhen Xie; Lihua Chi; Jie Liu; Chunye Gong
Sparse matrix‐vector multiplication (SpMV) is an essential kernel in sparse linear algebra and has been studied extensively on all modern processor and accelerator architectures. Compressed Sparse Row (CSR) is a frequently used format for sparse matrices storage. However, CSR‐based SpMV has poor performance on processors with vector units. In order to take full advantage of SIMD acceleration technology in SpMV, we proposed a new matrix storage format called CSR‐SIMD. The new storage format compresses the non‐zero elements into many variable‐length data fragments with consecutive memory access addresses. Thus, the data locality of sparse matrix A and dense vector x expands and the floating‐point operations for each fragment can be completely calculated by vectorized implementation on wide SIMD units. Our experimental results indicate that CSR‐SIMD has better storage efficiency and low‐overhead for format conversion. Besides, the new format achieves high scalability on wide SIMD units. In comparison with the CSR‐based and BCSR‐based SpMV, CSR‐SIMD obtains better performance on FT1500A, Intel Xeon, and Intel Xeon Phi.
Science in China Series F: Information Sciences | 2018
Xinbiao Gan; Yikun Hu; Jie Liu; Lihua Chi; Han Xu; Chunye Gong; Shengguo Li; Yihui Yan
HPL is a Linpack benchmark package widely used in high-performance computing tests. Customizing the HPL is crucial for a heterogeneous system equipped with CPU and the China accelerator because of the complexity of the China accelerator and the specified interface on matrix multiplication built in the China accelerator. Therefore, it is advisable to use delicate partition and encapsulation on matrix (DPEM) to expose a friendly testing configuration. More importantly, we propose the orchestrating algorithm for matrix multiplication (OAMM) to enhance the efficiency of the heterogeneous system composed of CPU and China accelerator. Furthermore, optimization at vectorization (OPTVEC) is applied to shield the architectural details of the vector processing element (VPE) equipped in the China accelerator. The experimental results validate DPEM, OPTVEC and OAMM. OPTVEC optimizations would speed up matrix multiplication more than twofold, moreover OAMM would improve productivity by up to 10% compared to the traditional HPL tested in a heterogeneous system.
IEEE Conference Anthology | 2013
Bo Yang; Qingfeng Hu; Jie Liu; Chunye Gong
The Monte Carlo particle transport algorithms are ideally suited to parallel processing architectures and so are good candidates for acceleration using a Graphics Processor Unit (GPU). As the foundation of Monte Carlo N-Particle Transport Code (MCNP), Pseudo Random Number Generator (PRNG) should be provided with some specified nature such as long period, high quality and fast generation. Newer NVIDIA Fermi architecture based GPU offer a dramatic performance improvement in double precision, which provides a good fundament for an effective implementation of PRNG. This paper presents an effective implementation of the 48bit PRNG algorithm proposed in MPI version of MCNP on GPU. After the optimization of GPU memory utilization and execution parameters of our PRNG, experimental results show that the performance speedup of one NVIDIA M2050 GPU with full double precision floating operations is up to 11-fold factor compared with the parallel implementation on one multi-core Intel Xeon X5670.