Yonggang Che
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yonggang Che.
Journal of Computational Physics | 2014
Chuanfu Xu; Xiaogang Deng; Lilun Zhang; Jianbin Fang; Guangxue Wang; Yi Jiang; Wei Cao; Yonggang Che; Yongxian Wang; Zhenghua Wang; Wei Liu; Xinghua Cheng
Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3i?, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and Chinas large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
international supercomputing conference | 2013
Chuanfu Xu; Xiaogang Deng; Lilun Zhang; Yi Jiang; Wei Cao; Jianbin Fang; Yonggang Che; Yongxian Wang; Wei Liu
In this paper, with MPI+CUDA, we present a dual-level parallelization of a high-order CFD software for 3D, multi-block structural girds on the TianHe-1A supercomputer. A self-developed compact high-order finite difference scheme HDCS is used in the CFD software. Our GPU parallelization can efficiently exploit both fine-grained data-level parallelism within a grid block and coarse-grained task-level parallelism among multiple grid blocks. Further, we perform multiple systematic optimizations for the high-order CFD scheme at the CUDA-device level and the cluster level. We present the performance results using up to 256 GPUs (with 114K+ processing cores) on TianHe-1A. We can achieve a speedup of over 10 when comparing our GPU code on a Tesla M2050 with the serial code on an Xeon X5670, and our implementation scales well on TianHe-1A. With our method, we successfully simulate a flow over a high-lift airfoil configuration using 400 GPUs. To the authors’ best knowledge, our work involves the largest-scale simulation on GPU-accelerated systems that solves a realistic CFD problem with complex configurations and high-order schemes.
The Computer Journal | 2015
Yonggang Che; Chuanfu Xu; Jianbin Fang; Yongxian Wang; Zhenghua Wang
This paper studies the performance characteristics of computational fluid dynamics (CFD) applications on Intel Many Integrated Core (MIC) architecture. Three CFD applications, BT-MZ, LM3D and HOSTA, are evaluated on Intel Knights Corner (KNC) coprocessor, the first public MIC product. The results show that the pure OpenMP scalability of these applications is not sufficient to utilize the potential of a KNC coprocessor. While utilizing the hybrid MPI/OpenMP programming model helps to improve the parallel scalability, the maximum parallel speedup relative to a single thread is still not satisfactory. The OpenCL version of BT-MZ performs better than the OpenMP version but is not comparable to the MPI version and the hybrid MPI/OpenMP version. At the micro-architecture level, while the three CFD applications achieve reasonable instruction execution rates and L1 data cache hit rates, use a large percent of vector instructions, they have low arithmetic density, incur very high branch misprediction rates and do not utilize the Vector Processing Unit efficiently. As a result, they achieve very low single thread floating-point efficiency. For these applications to attain competitive performance on the MIC architecture as on the Xeon processors, both the parallel scalability and the single thread performance should be improved, which is a difficult task.
international parallel and distributed processing symposium | 2014
Chuanfu Xu; Lilun Zhang; Xiaogang Deng; Jianbin Fang; Guangxue Wang; Wei Cao; Yonggang Che; Yongxian Wang; Wei Liu
HOSTA is an in-house high-order CFD software that can simulate complex flows with complex geometries. Large scale high-order CFD simulations using HOSTA require massive HPC resources, thus motivating us to port it onto modern GPU accelerated supercomputers like Tianhe-1A. To achieve a greater speedup and fully tap the potential of Tianhe-1A, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present multiple novel techniques to balance the loads between the store-poor GPU and the store-rich CPU, and overlap the collaborative computation and communication as far as possible. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per Tianhe-1A node for HOSTA by 2.3X, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 Tianhe-1A nodes. With our method, we have successfully simulated Chinas large civil airplane configuration C919 containing 150M grid cells. To our best knowledge, this is the first paper that reports a CPUGPU collaborative high-order accurate aerodynamic simulation result with such a complex grid geometry.
The Journal of Supercomputing | 2014
Yonggang Che; Lilun Zhang; Yongxian Wang; Chuanfu Xu; Wei Liu; Zhenghua Wang
This paper comparatively evaluates the microarchitectural performance of two representative Computational Fluid Dynamics (CFD) applications on the Intel Many Integrated Core (MIC) product, the Intel Knights Corner (KNC) coprocessor, and the Intel Sand Bridge (SNB) processor. Performance Monitoring Unit-based measurement method is used, along with a two-phase measurement method and some considerations to minimize the errors and instabilities. The results show that the CFD applications are sensitive to architecture factors. Their single thread performance and efficiency on KNC are much lower than that on SNB. Branch prediction and memory access are two primary factors that make the performance difference. The applications’ low-computational intensity and inefficient vector instruction usage are two additional factors. To be more efficient for the CFD applications, the MIC architecture needs to improve its branch prediction mechanism and memory hierarchy. Fine tuning of application codes is also crucial and is hard work.
ACA | 2014
Yonggang Che; Lilun Zhang; Yongxian Wang; Chuanfu Xu; Wei Liu; Xinghua Cheng
This paper reports our experience optimizing the performance of a high-order and high accurate Computational Fluid Dynamics (CFD) application (HOSTA) on the state of art multicore processor and the emerging Intel Many Integrated Core (MIC) coprocessor. We focus on effective loop vectorization and memory access optimization. A series techniques, including data structure transformations, procedure inlining, compiler SIMDization, OpenMP loop collapsing, and the use of Huge Pages, are explored. Detailed execution time and event counts from Performance Monitoring Units are measured. The results show that our optimizations have improved the performance of HOSTA by 1.61× on a two Intel Sandy Bridge processors based computer node and 1.97× on a Intel Knights Corner coprocessor, the public MIC product. The microarchitecture level effects of these optimizations are also discussed.
computer and information technology | 2010
Chuanfu Xu; Yonggang Che; Jianbin Fang; Zhenghua Wang
This paper addresses the optimization of parallel simulators for large-scale parallel systems and applications. Such simulators are often based on parallel discrete event simulation with conservative or optimistic protocols to synchronize the simulating processes. The paper considers how available future information about events and application behaviors can be efficiently extracted and further exploited to improve the performance of adaptive optimistic protocols. First, we extract information about future events and their dependencies in application traces to guide adaptive adjustments of time window in trace-driven parallel simulation. Second, we use information about application behaviors, specifically the iterative behavior found in many applications, to avoid the unnecessary adjustments of time window. These techniques are implemented in the BigSim simulator and tested by real-world and standard benchmark applications including Jacobi3D and HPL. The results show that our optimization approaches can reduce the execution times of simulation ranging from 11% up to 32%. Moreover, our methods are easy to implement and don’t need to augment compilers or even modify the core codes of parallel simulators.
2016 International Conference on Software Analysis, Testing and Evolution (SATE) | 2016
Yonggang Che; Chuanfu Xu; Zhenghua Wang
Powering is an important operation in many computation intensive workloads. This paper investigates the performance of different styles to calculate the powering operations from the application level. A series of small benchmark codes that calculate the powering operations in different ways are designed. Their performance is evaluated on Intel Xeon CPU under Intel compilation environments. The results show that the number of floating-point operations and the related runtime are sensitive to the value of the exponent Y and how it is used. When Y is an immediate integer number whose value is known at compile time, the cost of powering is much less than the situation when Y is an integer variable whose value is known at runtime. When Y is defined as a real variable, the cost of powering is always high, be it equals to an integer number or not. Based on the investigations, performance optimizations are applied to a kernel subroutine from a real-world supersonic combustion simulation code, which intensively involves powering operations. The result shows that the performance of that subroutine is improved for 13.25 times on the Intel Xeon E5-2692 CPU.
parallel computing | 2018
Yonggang Che; Meifang Yang; Chuanfu Xu; Yutong Lu
Abstract Combustion simulation is complex and computationally expensive as it involves integration of fundamental chemical kinetics and multidimensional Computational Fluid Dynamics (CFD) models. This paper presents our efforts porting a real-world supersonic combustion simulation application to the heterogeneous architecture consists of multi-core CPUs and Intel Many Integrated Core (MIC) coprocessors. Scalable OpenMP parallelization is added to make use of the large number of cores on CPUs and MIC coprocessors. Single thread performance optimizations are addressed to improve the computational efficiency. CPU and MIC collaborative algorithm, along with a series of techniques to improve the data transfer efficiency and load balance, are applied. Performance evaluation is performed on the Tianhe-2 supercomputer. The results show that on a single node, the optimized CPU-only version is 8.33 times faster than the baseline version, and the CPU + MIC heterogeneous version is again 3.07 times faster than the optimized CPU-only version. The resulting codes effectively scale to 5120 nodes (998,400 cores) on a mesh with 27.46 Giga cells. Given that the total number of floating-point operations is reduced by about 10 times after our optimizations, the heterogeneous version still achieves a sustained double precision floating-point performance of 0.46 Pflops on 5120 nodes. This demonstrates Petascale heterogeneous computing capabilities for real-world supersonic combustion problems.
parallel computing | 2013
Chuanfu Xu; Wei Cao; Lilun Zhang; Guangxue Wang; Yonggang Che; Yongxian Wang; Wei Liu
In this paper, we present a MPI-CUDA implementation for our in-house CFD software HOSTA to accelerate large-scale high-order CFD simulations on the TianHe-1A supercomputer. HOSTA employs a fifth order weighted compact nonlinear scheme (WCNS-E5) for flux calculation and a Runge-Kutta method for time integration. In our GPU parallelization scheme, we use CUDA thrad blocks to efficiently exploit fine-grained parallelism within a 3D grid block, and CUDA multiple streams to exploit coarse-grained parallelism among multiple grid blocks. At the CUDA-device level, we decompose complex flux kernels to optimize the GPU performance . At the cluster level, we present a Scatter-Gather optimization to reduce the PEI-E data transfer times for 3D block boundary/singularity data, and we overlap MPI communication and GPU execution. We achieve a speedup of about 10 when comparing our GPU code on a Tesla M2050 with the serial code on a Xeon X5670, and our implementation scales well to 128 GPUs on TianHe-1A.