Is this you? Create Your Porfile

Lilun Zhang

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lilun Zhang is active.

Explore More

Publication

Featured researches published by Lilun Zhang.

Journal of Computational Physics | 2014

Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

Chuanfu Xu; Xiaogang Deng; Lilun Zhang; Jianbin Fang; Guangxue Wang; Yi Jiang; Wei Cao; Yonggang Che; Yongxian Wang; Zhenghua Wang; Wei Liu; Xinghua Cheng

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3i?, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and Chinas large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.

international supercomputing conference | 2013

Parallelizing a High-Order CFD Software for 3D, Multi-block, Structural Grids on the TianHe-1A Supercomputer

Chuanfu Xu; Xiaogang Deng; Lilun Zhang; Yi Jiang; Wei Cao; Jianbin Fang; Yonggang Che; Yongxian Wang; Wei Liu

In this paper, with MPI+CUDA, we present a dual-level parallelization of a high-order CFD software for 3D, multi-block structural girds on the TianHe-1A supercomputer. A self-developed compact high-order finite difference scheme HDCS is used in the CFD software. Our GPU parallelization can efficiently exploit both fine-grained data-level parallelism within a grid block and coarse-grained task-level parallelism among multiple grid blocks. Further, we perform multiple systematic optimizations for the high-order CFD scheme at the CUDA-device level and the cluster level. We present the performance results using up to 256 GPUs (with 114K+ processing cores) on TianHe-1A. We can achieve a speedup of over 10 when comparing our GPU code on a Tesla M2050 with the serial code on an Xeon X5670, and our implementation scales well on TianHe-1A. With our method, we successfully simulate a flow over a high-lift airfoil configuration using 400 GPUs. To the authors’ best knowledge, our work involves the largest-scale simulation on GPU-accelerated systems that solves a realistic CFD problem with complex configurations and high-order schemes.

international parallel and distributed processing symposium | 2014

Balancing CPU-GPU Collaborative High-Order CFD Simulations on the Tianhe-1A Supercomputer

Chuanfu Xu; Lilun Zhang; Xiaogang Deng; Jianbin Fang; Guangxue Wang; Wei Cao; Yonggang Che; Yongxian Wang; Wei Liu

HOSTA is an in-house high-order CFD software that can simulate complex flows with complex geometries. Large scale high-order CFD simulations using HOSTA require massive HPC resources, thus motivating us to port it onto modern GPU accelerated supercomputers like Tianhe-1A. To achieve a greater speedup and fully tap the potential of Tianhe-1A, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present multiple novel techniques to balance the loads between the store-poor GPU and the store-rich CPU, and overlap the collaborative computation and communication as far as possible. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per Tianhe-1A node for HOSTA by 2.3X, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 Tianhe-1A nodes. With our method, we have successfully simulated Chinas large civil airplane configuration C919 containing 150M grid cells. To our best knowledge, this is the first paper that reports a CPUGPU collaborative high-order accurate aerodynamic simulation result with such a complex grid geometry.

The Journal of Supercomputing | 2014

Microarchitectural performance comparison of Intel Knights Corner and Intel Sandy Bridge with CFD applications

Yonggang Che; Lilun Zhang; Yongxian Wang; Chuanfu Xu; Wei Liu; Zhenghua Wang

This paper comparatively evaluates the microarchitectural performance of two representative Computational Fluid Dynamics (CFD) applications on the Intel Many Integrated Core (MIC) product, the Intel Knights Corner (KNC) coprocessor, and the Intel Sand Bridge (SNB) processor. Performance Monitoring Unit-based measurement method is used, along with a two-phase measurement method and some considerations to minimize the errors and instabilities. The results show that the CFD applications are sensitive to architecture factors. Their single thread performance and efficiency on KNC are much lower than that on SNB. Branch prediction and memory access are two primary factors that make the performance difference. The applications’ low-computational intensity and inefficient vector instruction usage are two additional factors. To be more efficient for the CFD applications, the MIC architecture needs to improve its branch prediction mechanism and memory hierarchy. Fine tuning of application codes is also crucial and is hard work.

ACA | 2014

Performance Optimization of a CFD Application on Intel Multicore and Manycore Architectures

Yonggang Che; Lilun Zhang; Yongxian Wang; Chuanfu Xu; Wei Liu; Xinghua Cheng

This paper reports our experience optimizing the performance of a high-order and high accurate Computational Fluid Dynamics (CFD) application (HOSTA) on the state of art multicore processor and the emerging Intel Many Integrated Core (MIC) coprocessor. We focus on effective loop vectorization and memory access optimization. A series techniques, including data structure transformations, procedure inlining, compiler SIMDization, OpenMP loop collapsing, and the use of Huge Pages, are explored. Detailed execution time and event counts from Performance Monitoring Units are measured. The results show that our optimizations have improved the performance of HOSTA by 1.61× on a two Intel Sandy Bridge processors based computer node and 1.97× on a Intel Knights Corner coprocessor, the public MIC product. The microarchitecture level effects of these optimizations are also discussed.

Computers & Fluids | 2018

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

Yongxian Wang; Lilun Zhang; Wei Liu; Xinghua Cheng; Yu Zhuang; Anthony T. Chronopoulos

Abstract For computational fluid dynamics (CFD) applications with a large number of grid points/cells, parallel computing is a common efficient strategy to reduce the computational time. How to achieve the best performance in the modern supercomputer system, especially with heterogeneous computing resources such as hybrid CPU+GPU, or a CPU + Intel Xeon Phi (MIC) co-processors, is still a great challenge.An in-house parallel CFD code capable of simulating three dimensional structured grid applications is developed and tested in this study. Several methods of parallelization, performance optimization and code tuning both in the CPU-only homogeneous system and in the heterogeneous system are proposed based on identifying potential parallelism of applications, balancing the work load among all kinds of computing devices, tuning the multi-thread code toward better performance in intra-machine node with hundreds of CPU/MIC cores, and optimizing the communication among inter-nodes, inter-cores, and between CPUs and MICs.Some benchmark cases from model and/or industrial CFD applications are tested on the Tianhe-1A and Tianhe-2 supercomputer to evaluate the performance. Among these CFD cases, the maximum number of grid cells reached 780 billion. The tuned solver successfully scales up to half of the entire Tianhe-2 supercomputer system with over 1.376 million of heterogeneous cores. The test results and performance analysis are discussed in detail.

parallel computing | 2013

Accelerating High-Order CFD Simulations for Multi-block Structured Grids on the TianHe-1A Supercomputer

Chuanfu Xu; Wei Cao; Lilun Zhang; Guangxue Wang; Yonggang Che; Yongxian Wang; Wei Liu

In this paper, we present a MPI-CUDA implementation for our in-house CFD software HOSTA to accelerate large-scale high-order CFD simulations on the TianHe-1A supercomputer. HOSTA employs a fifth order weighted compact nonlinear scheme (WCNS-E5) for flux calculation and a Runge-Kutta method for time integration. In our GPU parallelization scheme, we use CUDA thrad blocks to efficiently exploit fine-grained parallelism within a 3D grid block, and CUDA multiple streams to exploit coarse-grained parallelism among multiple grid blocks. At the CUDA-device level, we decompose complex flux kernels to optimize the GPU performance . At the cluster level, we present a Scatter-Gather optimization to reduce the PEI-E data transfer times for 3D block boundary/singularity data, and we overlap MPI communication and GPU execution. We achieve a speedup of about 10 when comparing our GPU code on a Tesla M2050 with the serial code on a Xeon X5670, and our implementation scales well to 128 GPUs on TianHe-1A.

international conference on performance engineering | 2014