Wangdong Yang
Hunan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Wangdong Yang.
IEEE Transactions on Parallel and Distributed Systems | 2015
Kenli Li; Wangdong Yang; Keqin Li
This paper presents a unique method of performance analysis and optimization for sparse matrix-vector multiplication (SpMV) on GPU. This method has wide adaptability for different types of sparse matrices and is different from existing methods which only adapt to some particular sparse matrices. In addition, our method does not need additional benchmarks to get optimized parameters, which are calculated directly through the probability mass function (PMF). We make the following contributions. (1) We present a PMF to analyze precisely the distribution pattern of non-zero elements in a sparse matrix. The PMF can provide theoretical basis for the compression of a sparse matrix. (2) Compression efficiency of COO, CSR, ELL, and HYB can be analyzed precisely through the PMF, and combined with the hardware parameters of GPU, the performance of SpMV based on COO, CSR, ELL, and HYB can be estimated. Furthermore, the most appropriate format for SpMV can be selected according to estimated value of the performance. Experiments prove that the theoretical estimated values and the tested values have high consistency. (3) For HYB, the optimal segmentation threshold can be found through the PMF to achieve the optimal performance for SpMV. Our performance modeling and analysis are very accurate. The order of magnitude of the estimated speedup and that of the tested speedup for each of the ten tested sparse matrices based on the three formats COO, CSR, and ELL are the same. The percentage of relative difference between an estimated value and a tested value is less than 20 percent for over 80 percent cases. The performance improvement of our algorithm is also effective. The average performance improvement of the optimal solution for HYB is over 15 percent compared with that of the automatic solution provided by CUSPARSE lib.
IEEE Transactions on Parallel and Distributed Systems | 2016
Kenli Li; Wangdong Yang; Keqin Li
There are some quasi-tridiagonal system of linear equations arising from numerical simulations, and some solving algorithms encounter great challenge on solving quasi-tridiagonal system of linear equations with more than millions of dimensions as the scale of problems increases. We present a solving method which mixes direct and iterative methods, and our method needs less storage space in a computing process. A quasi-tridiagonal matrix is split into a tridiagonal matrix and a sparse matrix using our method and then the tridiagonal equation can be solved by the direct methods in the iteration processes. Because the approximate solutions obtained by the direct methods are closer to the exact solutions, the convergence speed of solving the quasi-tridiagonal system of linear equations can be improved. Furthermore, we present an improved cyclic reduction algorithm using a partition strategy to solve tridiagonal equations on GPU, and the intermediate data in computing are stored in shared memory so as to significantly reduce the latency of memory access. According to our experiments on 10 test cases, the average number of iterations is reduced significantly by using our method compared with Jacobi, GS, GMRES, and BiCG respectively, and close to those of BiCGSTAB, BiCRSTAB, and TFQMR. For parallel mode, the parallel computing efficiency of our method is raised by partition strategy, and the performance using our method is better than those of the commonly used iterative and direct methods because of less amount of calculation in an iteration.
Journal of Parallel and Distributed Computing | 2017
Wangdong Yang; Kenli Li; Keqin Li
Abstract Sparse matrix–vector multiplication (SpMV) is an important issue in scientific computing and engineering applications. The performance of SpMV can be improved using parallel computing. The implementation and optimization of SpMV on GPU are research hotspots. Due to some irregularities of sparse matrices, the use of a single compression format is not satisfactory. The hybrid storage format can expand the range of adaptation of the compression algorithms. However, because of the imbalance of non-zero elements, the parallel computing capability of a GPU cannot be fully utilized. The parallel computing capability of a CPU is also rising due to increased number of cores in CPU. However, when a GPU is computing, the CPU controls the process instead of contributing to the computational work. It leads to under-utilization of the computing power of CPU. Due to the characteristics of the sparse matrices, the data can be split into two parts using the hybrid storage format to be allocated to CPU and GPU for simultaneous computing. In order to take full advantage of computing resources of CPU and GPU, the CPU–GPU heterogeneous computing model is adopted in this paper to improve the performance of SpMV. With analysis of the characteristics of CPU and GPU, an optimization strategy of sparse matrix partitioning using a distribution function is proposed to improve the computing performance of SpMV on the heterogeneous computing platform. The experimental results on two test machines demonstrate noticeable performance improvement.
The Journal of Supercomputing | 2017
Wangdong Yang; Kenli Li; Keqin Li
Solving block-tridiagonal systems is one of the key issues in numerical simulations of many scientific and engineering problems. Non-zero elements are mainly concentrated in the blocks on the main diagonal for most block-tridiagonal matrices, and the blocks above and below the main diagonal have little non-zero elements. Therefore, we present a solving method which mixes direct and iterative methods. In our method, the submatrices on the main diagonal are solved by the direct methods in the iteration processes. Because the approximate solutions obtained by the direct methods are closer to the exact solutions, the convergence speed of solving the block-tridiagonal system of linear equations can be improved. Some direct methods have good performance in solving small-scale equations, and the sub-equations can be solved in parallel. We present an improved algorithm to solve the sub-equations by thread blocks on GPU, and the intermediate data are stored in shared memory, so as to significantly reduce the latency of memory access. Furthermore, we analyze cloud resources scheduling model and obtain ten block-tridiagonal matrices which are produced by the simulation of the cloud-computing system. The computing performance of solving these block-tridiagonal systems of linear equations can be improved using our method.
Journal of Intelligent and Fuzzy Systems | 2016
Xiongwei Fei; Kenli Li; Wangdong Yang
In the open environment of cloud computing, a large amount of user data needs to be encrypted/decrypted fast to maintain confidentiality and provide high quality of service. Advanced Encryption Standard (AES), the standard encryption algorithm, has better security and efficiency compared to its competitive algorithms, so it is widely used in cloud computing and other fields. However, the implementation of AES based on software still has the problem of low efficiency; whereas the implementation of AES based on hardware needs to purchase special purpose devices. Adopting the method of special instruction sets can resolve the above two drawbacks. Therefore, we propose a fast parallel cryptographic algorithm, NIPAES, which is based on the AES-NI (New Instructions) instruction set and CPU multiple cores. NIPAES makes use of the block property of AES and the parallel property of Counter (CTR) model, adopts OpenMP to evenly distribute workloads to each thread, which performs AES-NI instructions to complete encryption/decryption. Compared to CPU serial AES based on lookup tables, CPU parallel AES, and serial AES based on AES-NI, NIPAES has significant improvement on performance. The experimental results show that NIPAES achieves the average speedups of 3197.78x, 196.12x, and 7.71x, compared to the other aforementioned algorithms, respectively.
Journal of Computer and System Sciences | 2018
Wangdong Yang; Kenli Li; Keqin Li
Abstract For large-scale sparse matrices, SpMV cannot be processed on GPU using the common storage formats because of the memory limitation. In addition, the parallel effect is poor using general formats for the sparse matrices with extremely uneven distribution of non-zero elements, which leads to performance deterioration. This paper presents an optimal partitioning strategy based on the distribution of non-zero elements in a sparse matrix to improve the performance of SpMV, and uses a hybrid format, which mixes CSR and ELL formats, to store the blocks partitioned from the sparse matrix. The hybrid blocked format has better compression effect and more uniform distribution of non-zero elements, which can be suitable for more types of sparse matrices. Our partitioning strategy is proven to be optimal, which can yield the minimum parallel execution time on GPU.
IEEE Transactions on Big Data | 2017
Xiongwei Fei; Kenli Li; Wangdong Yang; Keqin Li
In the environment of cloud computing, the data produced by massive users form a data stream and need to be protected by encryption for maintaining confidentiality. Traditional serial encryption algorithms are poor in performance and consume more energy without considering the property of streams. Therefore, we propose a velocity-aware parallel encryption algorithm with low energy consumption (LECPAES) for streams in cloud computing. The algorithm parallelizes Advanced Encryption Standard (AES) based on heterogeneous many-core architecture, adopts a sliding window to stabilize burst flows, senses the velocity of streams using the thresholds of the window computed by frequency ratios, and dynamically scales the frequency of Graphics Processing Units (GPUs) to lower down energy consumption. The experiments for streams at different velocities and the comparisons with other related algorithms show that the algorithm can reduce energy consumption, but only slightly increases retransmission rate and slightly decreases throughput. Therefore, LECPAES is an excellent algorithm for fast and energy-saving stream encryption.
IEEE Transactions on Computers | 2015
Wangdong Yang; Kenli Li; Zeyao Mo; Keqin Li
parallel computing | 2016
Xiongwei Fei; Kenli Li; Wangdong Yang; Keqin Li
Concurrency and Computation: Practice and Experience | 2016
Xiongwei Fei; Kenli Li; Wangdong Yang; Keqin Li