Is this you? Create Your Porfile

Nengxiong Xu

China University of Geosciences

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nengxiong Xu is active.

Explore More

Publication

Featured researches published by Nengxiong Xu.

Royal Society Open Science | 2017

Accelerating Adaptive IDW Interpolation Algorithm on a Single GPU

Gang Mei; Liangliang Xu; Nengxiong Xu

This paper focuses on designing and implementing parallel adaptive inverse distance weighting (AIDW) interpolation algorithms by using the graphics processing unit (GPU). The AIDW is an improved version of the standard IDW, which can adaptively determine the power parameter according to the data points’ spatial distribution pattern and achieve more accurate predictions than those predicted by IDW. In this paper, we first present two versions of the GPU-accelerated AIDW, i.e. the naive version without profiting from the shared memory and the tiled version taking advantage of the shared memory. We also implement the naive version and the tiled version using two data layouts, structure of arrays and array of aligned structures, on both single and double precision. We then evaluate the performance of parallel AIDW by comparing it with its corresponding serial algorithm on three different machines equipped with the GPUs GT730M, M5000 and K40c. The experimental results indicate that: (i) there is no significant difference in the computational efficiency when different data layouts are employed; (ii) the tiled version is always slightly faster than the naive version; and (iii) on single precision the achieved speed-up can be up to 763 (on the GPU M5000), while on double precision the obtained highest speed-up is 197 (on the GPU K40c). To benefit the community, all source code and testing data related to the presented parallel AIDW algorithm are publicly available.This paper focuses on the design and implementing of GPU-accelerated Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm. The AIDW is an improved version of the standard IDW, which can adaptively determine the power parameter according to the spatial points’ distribution pattern and achieve more accurate predictions than those by IDW. In this paper, we first present two versions of the GPU accelerated AIDW, the naive version without profiting from shared memory and the tiled version taking advantage of shared memory. We also implement the naive version and the tiled version using the data layouts, Structure of Arrays (AoS) and Array of aligned Structures (AoaS), on single and double precision. We then evaluate the performance of the GPU-accelerated AIDW by comparing it with its original CPU version. Experimental results show that: on single precision the naive version and the tiled version can achieve the speedups of approximately 270 and 400, respectively. In addition, on single precision the implementations using the layout SoA are always slightly faster than those using layout AoaS. However, on double precision, the speedup is only about 8; and we have also observed that: (1) there are no performance gains obtained from the tiled version against the naive version; and (2) the use of SoA and AoaS does not lead to significant differences in computational efficiency.

SpringerPlus | 2016

Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search

Gang Mei; Nengxiong Xu; Liangliang Xu

This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.

International Journal of Parallel Programming | 2018

Performance Evaluation of GPU-Accelerated Spatial Interpolation Using Radial Basis Functions for Building Explicit Surfaces

Zengyu Ding; Gang Mei; Salvatore Cuomo; Nengxiong Xu; Hong Tian

This paper focuses on evaluating the computational performance of parallel spatial interpolation with Radial Basis Functions (RBFs) that is developed by utilizing modern GPUs. The RBFs can be used in spatial interpolation to build explicit surfaces such as Discrete Elevation Models. When interpolating with large-size of data points and interpolated points for building explicit surfaces, the computational cost would be quite expensive. To improve the computational efficiency, we specifically develop a parallel RBF spatial interpolation algorithm on many-core GPUs, and compare it with the parallel version implemented on multi-core CPUs. Five groups of experimental tests are conducted on two machines to evaluate the computational efficiency of the presented GPU-accelerated RBF spatial interpolation algorithm. Experimental results indicate that: in most cases, the parallel RBF interpolation algorithm on many-core GPUs does not have any significant advantages over the parallel version on multi-core CPUs in terms of computational efficiency. This unsatisfied performance of the GPU-accelerated RBF interpolation algorithm is due to: (1) the limited size of global memory residing on the GPU, and (2) the need to solve a system of linear equations in each GPU thread to calculate the weights and prediction value of each interpolated point.

International Journal of Parallel Programming | 2018

MeshCleaner: A Generic and Straightforward Algorithm for Cleaning Finite Element Meshes

Gang Mei; Salvatore Cuomo; Hong Tian; Nengxiong Xu; Linjun Peng

Mesh cleaning is the procedure of removing duplicate nodes, sequencing the indices of remaining nodes, and then updating the mesh connectivity for a topologically invalid Finite Element mesh. To the best of our knowledge, there has been no previously reported work specifically focused on the cleaning of large Finite Element meshes. In this paper we specifically present a generic and straightforward algorithm, MeshCleaner, for cleaning large Finite Element meshes. The presented mesh cleaning algorithm is composed of (1) the stage of compacting and reordering nodes and (2) the stage of updating mesh topology. The basic ideas for performing the above two stages efficiently both in sequential and in parallel are introduced. Furthermore, one serial and two parallel implementations of the algorithm MeshCleaner are developed on multi-core CPU and/or many-core GPU. To evaluate the performance of our algorithm, three groups of experimental tests are conducted. Experimental results indicate that the algorithm MeshCleaner is capable of cleaning large meshes very efficiently, both in sequential and in parallel. The presented mesh cleaning algorithm MeshCleaner is generic, simple, and practical.

Heliyon | 2018

A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU

Gang Mei; Jiayin Zhang; Nengxiong Xu; Kunyang Zhao

We present a novel GPU-accelerated implementation of the QuickHull algorihtm for calculating convex hulls of planar point sets. We also describe a practical solu tion to demonstrate how to efficiently implement a typical Divide-and-Conquer algorithm on the GPU. We highly utilize the parallel primitives provided by the library Thrust such as the parallel segmented scan for better efficiency and simplicity. To evaluate the performance of our implementation, we carry out four groups of experimental tests using two groups of point sets in two modes on the GPU K20c. Experimental results indicate that: our implementation can achieve the speedups of up to 10.98x over the state-of-art CPU-based convex hull implementation Qhull [16]. In addition, our implementation can find the convex hull of 20M points in about 0.2 seconds.The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.

Future Generation Computer Systems | 2019

Efficient method for identifying influential vertices in dynamic networks using the strategy of local detection and updating

Shuangyan Wang; Salvatore Cuomo; Gang Mei; Wuyi Cheng; Nengxiong Xu

Abstract The identification of influential vertices in complex networks can facilitate understanding and prediction of the behaviour of real systems. In this paper, we propose an efficient method for identifying influential vertices in dynamic networks by exploiting the strategy of local detection and updating. The essential strategy of the proposed local detection and updating method is to locally detect the altered vertices in dynamic networks and locally update the influence metrics of the altered vertices, without the need to globally calculate the influence of all vertices. To evaluate the computational efficiency of the proposed local detection and updating method, we design 15 groups of experimental tests for three types of complex networks (the Barabasi–Albert (BA) scale-free network, the Watts–Strogatz (WS) small-world network, and the Erdo s–Renyi (ER) random network). Experimental results demonstrate that: (1) the sequential version of the proposed method is approximately 3 times faster than the global calculation method for the small-world networks and random networks; (2) the parallel version of the proposed method, which was developed on a multi-core CPU, is approximately 10 times faster than the global calculation method for the scale-free networks. The proposed local detection and updating method can be employed to efficiently identify the influential vertices and predict the changes in influence of specified sets of vertices in dynamic networks.

International Journal of Parallel Programming | 2018

Comparison of Estimating Missing Values in IoT Time Series Data Using Different Interpolation Algorithms

Zengyu Ding; Gang Mei; Salvatore Cuomo; Yixuan Li; Nengxiong Xu

When collecting the Internet of Things data using various sensors or other devices, it may be possible to miss several kinds of values of interest. In this paper, we focus on estimating the missing values in IoT time series data using three interpolation algorithms, including (1) Radial Basis Functions, (2) Moving Least Squares (MLS), and (3) Adaptive Inverse Distance Weighted. To evaluate the performance of estimating missing values, we estimate the missing values in eight selected sets of IoT time series data, and compare with those imputed by the standard k NN estimator. Our experiments indicate that in most experiments the estimation based on the Lancaster’s MLS is the best. It is also found that the number of nearest observed values for reference and the distribution of missing values could strongly affect the accuracy of imputation.

arXiv: Distributed, Parallel, and Cluster Computing | 2015

On the Accelerating of Two-dimensional Smart Laplacian Smoothing on the GPU

Kunyang Zhao; Gang Mei; Nengxiong Xu; Jiayin Zhang

This paper presents a GPU-accelerated implementation of two-dimensional Smart Laplacian smoothing. This implementation is developed under the guideline of our paradigm for accelerating Laplacianbased mesh smoothing [13]. Two types of commonly used data layouts, Array-of-Structures (AoS) and Structure-of-Arrays (SoA) are used to represent triangular meshes in our implementation. Two iteration forms that have different choices of the swapping of intermediate data are also adopted. Furthermore, the feature CUDA Dynamic Parallelism (CDP) is employed to realize the nested parallelization in Smart Laplacian smoothing. Experimental results demonstrate that: (1) our implementation can achieve the speedups of up to 44x on the GPU GT640; (2) the data layout AoS can always obtain better efficiency than the SoA layout; (3) the form that needs to swap intermediate nodal coordinates is always slower than the one that does not swap data; (4) the version of our implementation with the use of the feature CDP is slightly faster than the version where the CDP is not adopted.

International Journal of Rock Mechanics and Mining Sciences | 2016