Yongxian Wang
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yongxian Wang.
Journal of Computational Physics | 2014
Chuanfu Xu; Xiaogang Deng; Lilun Zhang; Jianbin Fang; Guangxue Wang; Yi Jiang; Wei Cao; Yonggang Che; Yongxian Wang; Zhenghua Wang; Wei Liu; Xinghua Cheng
Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3i?, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and Chinas large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
international conference on computer application and system modeling | 2010
Wei Cao; Lu Yao; Zongzhe Li; Yongxian Wang; Zhenghua Wang
The Sparse Matrix-Vector product (SpMV) is a key operation in engineering and scientific computing. Methods for efficiently implementing it in parallel are critical to the performance of many applications. Modern Graphics Processing Units (GPUs) coupled with the advent of general purpose programming environments like NVIDIAs CUDA, have gained interest as a viable architecture for data-parallel general purpose computations. Currently, SpMV implementations using CUDA based on common sparse matrix format have already appeared. Among them, the performance of implementation based on ELLPACK-R format is the best. However, in this implementation, when the maximum number of nonzeros per row does substantially differ from the average, thread is suffering from load imbalance. This paper proposes a new matrix storage format called ELLPACK-RP, which combines ELLPACK-R format with JAD format, and implements the SpMV using CUDA based on it. The result proves that it can decrease the load imbalance and improve the SpMV performance efficiently.
international supercomputing conference | 2013
Chuanfu Xu; Xiaogang Deng; Lilun Zhang; Yi Jiang; Wei Cao; Jianbin Fang; Yonggang Che; Yongxian Wang; Wei Liu
In this paper, with MPI+CUDA, we present a dual-level parallelization of a high-order CFD software for 3D, multi-block structural girds on the TianHe-1A supercomputer. A self-developed compact high-order finite difference scheme HDCS is used in the CFD software. Our GPU parallelization can efficiently exploit both fine-grained data-level parallelism within a grid block and coarse-grained task-level parallelism among multiple grid blocks. Further, we perform multiple systematic optimizations for the high-order CFD scheme at the CUDA-device level and the cluster level. We present the performance results using up to 256 GPUs (with 114K+ processing cores) on TianHe-1A. We can achieve a speedup of over 10 when comparing our GPU code on a Tesla M2050 with the serial code on an Xeon X5670, and our implementation scales well on TianHe-1A. With our method, we successfully simulate a flow over a high-lift airfoil configuration using 400 GPUs. To the authors’ best knowledge, our work involves the largest-scale simulation on GPU-accelerated systems that solves a realistic CFD problem with complex configurations and high-order schemes.
The Computer Journal | 2015
Yonggang Che; Chuanfu Xu; Jianbin Fang; Yongxian Wang; Zhenghua Wang
This paper studies the performance characteristics of computational fluid dynamics (CFD) applications on Intel Many Integrated Core (MIC) architecture. Three CFD applications, BT-MZ, LM3D and HOSTA, are evaluated on Intel Knights Corner (KNC) coprocessor, the first public MIC product. The results show that the pure OpenMP scalability of these applications is not sufficient to utilize the potential of a KNC coprocessor. While utilizing the hybrid MPI/OpenMP programming model helps to improve the parallel scalability, the maximum parallel speedup relative to a single thread is still not satisfactory. The OpenCL version of BT-MZ performs better than the OpenMP version but is not comparable to the MPI version and the hybrid MPI/OpenMP version. At the micro-architecture level, while the three CFD applications achieve reasonable instruction execution rates and L1 data cache hit rates, use a large percent of vector instructions, they have low arithmetic density, incur very high branch misprediction rates and do not utilize the Vector Processing Unit efficiently. As a result, they achieve very low single thread floating-point efficiency. For these applications to attain competitive performance on the MIC architecture as on the Xeon processors, both the parallel scalability and the single thread performance should be improved, which is a difficult task.
Concurrency and Computation: Practice and Experience | 2016
Dali Li; Chuanfu Xu; Yongxian Wang; Zhifang Song; Min Xiong; Xiang Gao; Xiaogang Deng
The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large‐scale LBM simulations with increasing resolution and extending temporal range require massive high‐performance computing (HPC) resources, thus motivating us to port it onto modern many‐core heterogeneous supercomputers like Tianhe‐2. Although many‐core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating‐point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large‐scale heterogeneous system.
international parallel and distributed processing symposium | 2014
Chuanfu Xu; Lilun Zhang; Xiaogang Deng; Jianbin Fang; Guangxue Wang; Wei Cao; Yonggang Che; Yongxian Wang; Wei Liu
HOSTA is an in-house high-order CFD software that can simulate complex flows with complex geometries. Large scale high-order CFD simulations using HOSTA require massive HPC resources, thus motivating us to port it onto modern GPU accelerated supercomputers like Tianhe-1A. To achieve a greater speedup and fully tap the potential of Tianhe-1A, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present multiple novel techniques to balance the loads between the store-poor GPU and the store-rich CPU, and overlap the collaborative computation and communication as far as possible. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per Tianhe-1A node for HOSTA by 2.3X, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 Tianhe-1A nodes. With our method, we have successfully simulated Chinas large civil airplane configuration C919 containing 150M grid cells. To our best knowledge, this is the first paper that reports a CPUGPU collaborative high-order accurate aerodynamic simulation result with such a complex grid geometry.
The Journal of Supercomputing | 2014
Yonggang Che; Lilun Zhang; Yongxian Wang; Chuanfu Xu; Wei Liu; Zhenghua Wang
This paper comparatively evaluates the microarchitectural performance of two representative Computational Fluid Dynamics (CFD) applications on the Intel Many Integrated Core (MIC) product, the Intel Knights Corner (KNC) coprocessor, and the Intel Sand Bridge (SNB) processor. Performance Monitoring Unit-based measurement method is used, along with a two-phase measurement method and some considerations to minimize the errors and instabilities. The results show that the CFD applications are sensitive to architecture factors. Their single thread performance and efficiency on KNC are much lower than that on SNB. Branch prediction and memory access are two primary factors that make the performance difference. The applications’ low-computational intensity and inefficient vector instruction usage are two additional factors. To be more efficient for the CFD applications, the MIC architecture needs to improve its branch prediction mechanism and memory hierarchy. Fine tuning of application codes is also crucial and is hard work.
international conference on intelligent human-machine systems and cybernetics | 2010
Lu Yao; Wei Cao; Zongzhe Li; Yongxian Wang; Zhenghua Wang
The independent set ordering algorithm is a heuristic algorithm based on finding maximal independent sets of vertices in the matrix adjacency graph, which is commonly used for parallel matrix factorization. However, Disadvantages appear when it is applied to large-scale sparse linear systems. In this paper, we propose an improved algorithm by finding an optimal size of independent set in each elimination step rather than find a maximal independent set, which is proved to be effective by both theoretical analysis and parallel implementation.
ACA | 2014
Yonggang Che; Lilun Zhang; Yongxian Wang; Chuanfu Xu; Wei Liu; Xinghua Cheng
This paper reports our experience optimizing the performance of a high-order and high accurate Computational Fluid Dynamics (CFD) application (HOSTA) on the state of art multicore processor and the emerging Intel Many Integrated Core (MIC) coprocessor. We focus on effective loop vectorization and memory access optimization. A series techniques, including data structure transformations, procedure inlining, compiler SIMDization, OpenMP loop collapsing, and the use of Huge Pages, are explored. Detailed execution time and event counts from Performance Monitoring Units are measured. The results show that our optimizations have improved the performance of HOSTA by 1.61× on a two Intel Sandy Bridge processors based computer node and 1.97× on a Intel Knights Corner coprocessor, the public MIC product. The microarchitecture level effects of these optimizations are also discussed.
biomedical engineering and informatics | 2011
Lu Yao; Zhenghua Wang; Zongzhe Li; Wei Cao; Yongxian Wang
When applying multilevel scheme to solve the static graph partitioning problem, shortcomings and limitations exist in the state-of-the-art coarsening schemes depend mainly on finding maximal matchings to obtain the successively coarse graphs, especially for partitioning irregular graphs, which can cause the multilevel algorithms to produce poor-quality solutions. This paper proposes an improved coarsening scheme by re-collapsing the matching in each coarsening step. The new coarsening scheme is more effective in quality, which is proved to be effective by both theoretical analysis and experimental results.