Chuanfu Xu
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Chuanfu Xu.
Journal of Computational Physics | 2014
Chuanfu Xu; Xiaogang Deng; Lilun Zhang; Jianbin Fang; Guangxue Wang; Yi Jiang; Wei Cao; Yonggang Che; Yongxian Wang; Zhenghua Wang; Wei Liu; Xinghua Cheng
Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3i?, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and Chinas large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
international supercomputing conference | 2013
Chuanfu Xu; Xiaogang Deng; Lilun Zhang; Yi Jiang; Wei Cao; Jianbin Fang; Yonggang Che; Yongxian Wang; Wei Liu
In this paper, with MPI+CUDA, we present a dual-level parallelization of a high-order CFD software for 3D, multi-block structural girds on the TianHe-1A supercomputer. A self-developed compact high-order finite difference scheme HDCS is used in the CFD software. Our GPU parallelization can efficiently exploit both fine-grained data-level parallelism within a grid block and coarse-grained task-level parallelism among multiple grid blocks. Further, we perform multiple systematic optimizations for the high-order CFD scheme at the CUDA-device level and the cluster level. We present the performance results using up to 256 GPUs (with 114K+ processing cores) on TianHe-1A. We can achieve a speedup of over 10 when comparing our GPU code on a Tesla M2050 with the serial code on an Xeon X5670, and our implementation scales well on TianHe-1A. With our method, we successfully simulate a flow over a high-lift airfoil configuration using 400 GPUs. To the authors’ best knowledge, our work involves the largest-scale simulation on GPU-accelerated systems that solves a realistic CFD problem with complex configurations and high-order schemes.
The Computer Journal | 2015
Yonggang Che; Chuanfu Xu; Jianbin Fang; Yongxian Wang; Zhenghua Wang
This paper studies the performance characteristics of computational fluid dynamics (CFD) applications on Intel Many Integrated Core (MIC) architecture. Three CFD applications, BT-MZ, LM3D and HOSTA, are evaluated on Intel Knights Corner (KNC) coprocessor, the first public MIC product. The results show that the pure OpenMP scalability of these applications is not sufficient to utilize the potential of a KNC coprocessor. While utilizing the hybrid MPI/OpenMP programming model helps to improve the parallel scalability, the maximum parallel speedup relative to a single thread is still not satisfactory. The OpenCL version of BT-MZ performs better than the OpenMP version but is not comparable to the MPI version and the hybrid MPI/OpenMP version. At the micro-architecture level, while the three CFD applications achieve reasonable instruction execution rates and L1 data cache hit rates, use a large percent of vector instructions, they have low arithmetic density, incur very high branch misprediction rates and do not utilize the Vector Processing Unit efficiently. As a result, they achieve very low single thread floating-point efficiency. For these applications to attain competitive performance on the MIC architecture as on the Xeon processors, both the parallel scalability and the single thread performance should be improved, which is a difficult task.
Cluster Computing | 2014
Wei Cao; Chuanfu Xu; Zhenghua Wang; Lu Yao; Hua-yong Liu
The high-order schemes have attracted more and more attention in computational fluid dynamics (CFD) simulations. As a kind of high-order schemes, weighted compact nonlinear schemes (WCNSs) have been widely applied in large eddy simulations, direct numerical simulations etc. However, due to the computational complexity, WCNSs require high-performance platforms. In recent years, the highly parallel graphics processing unit (GPU) is rapidly gaining maturity as a powerful engine for high performance computer. In this paper, we present a high-order double-precision solver of the three-dimensional, compressible viscous flow using multi-block structured grids on GPU clusters. The solver utilizes the high-order WCNS scheme for space discretization and Jacobi iteration method for time discretization. In order to utilize the computational capability of CPU and GPU for the solver, we present a workload balancing model for distributing workload among CPUs and GPUs. And we design two strategies to overlap computations with communications. The performance analyses show that the single-GPU solver achieves about 8× speed-ups relative to a serial computation on a CPU core. The performance results validate the workload distribution scheme. The strong and weak scaling analyses show that GPU clusters offer a significant advantage in performance.
Concurrency and Computation: Practice and Experience | 2016
Dali Li; Chuanfu Xu; Yongxian Wang; Zhifang Song; Min Xiong; Xiang Gao; Xiaogang Deng
The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large‐scale LBM simulations with increasing resolution and extending temporal range require massive high‐performance computing (HPC) resources, thus motivating us to port it onto modern many‐core heterogeneous supercomputers like Tianhe‐2. Although many‐core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating‐point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large‐scale heterogeneous system.
international parallel and distributed processing symposium | 2014
Chuanfu Xu; Lilun Zhang; Xiaogang Deng; Jianbin Fang; Guangxue Wang; Wei Cao; Yonggang Che; Yongxian Wang; Wei Liu
HOSTA is an in-house high-order CFD software that can simulate complex flows with complex geometries. Large scale high-order CFD simulations using HOSTA require massive HPC resources, thus motivating us to port it onto modern GPU accelerated supercomputers like Tianhe-1A. To achieve a greater speedup and fully tap the potential of Tianhe-1A, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present multiple novel techniques to balance the loads between the store-poor GPU and the store-rich CPU, and overlap the collaborative computation and communication as far as possible. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per Tianhe-1A node for HOSTA by 2.3X, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 Tianhe-1A nodes. With our method, we have successfully simulated Chinas large civil airplane configuration C919 containing 150M grid cells. To our best knowledge, this is the first paper that reports a CPUGPU collaborative high-order accurate aerodynamic simulation result with such a complex grid geometry.
The Journal of Supercomputing | 2014
Yonggang Che; Lilun Zhang; Yongxian Wang; Chuanfu Xu; Wei Liu; Zhenghua Wang
This paper comparatively evaluates the microarchitectural performance of two representative Computational Fluid Dynamics (CFD) applications on the Intel Many Integrated Core (MIC) product, the Intel Knights Corner (KNC) coprocessor, and the Intel Sand Bridge (SNB) processor. Performance Monitoring Unit-based measurement method is used, along with a two-phase measurement method and some considerations to minimize the errors and instabilities. The results show that the CFD applications are sensitive to architecture factors. Their single thread performance and efficiency on KNC are much lower than that on SNB. Branch prediction and memory access are two primary factors that make the performance difference. The applications’ low-computational intensity and inefficient vector instruction usage are two additional factors. To be more efficient for the CFD applications, the MIC architecture needs to improve its branch prediction mechanism and memory hierarchy. Fine tuning of application codes is also crucial and is hard work.
The Journal of Supercomputing | 2017
Dali Li; Chuanfu Xu; Bin Cheng; Min Xiong; Xiang Gao; Xiaogang Deng
As a typical Gauss–Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss–Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipelined parallel LU-SGS approach achieves promising scalability. However, on emerging many-core processors such as Xeon Phi, experience from our in-house high-order CFD program show that the parallel efficiency drops dramatically to less than 25%. In this paper, we model and analyze the performance of the pipelined parallel LU-SGS algorithm, present a two-level pipeline (TL-Pipeline) approach using nested OpenMP to further exploit fine-grained parallelisms and mitigate the parallel performance bottlenecks. Our TL-Pipeline approach achieves 20% performance gains for a regular problem
International Journal of Computational Fluid Dynamics | 2017
Xiang Gao; Yidao Dong; Chuanfu Xu; Min Xiong; Zhenghua Wang; Xiaogang Deng
ACA | 2014
Yonggang Che; Lilun Zhang; Yongxian Wang; Chuanfu Xu; Wei Liu; Xinghua Cheng
(256\times 256\times 256)