Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yunfei Du is active.

Publication


Featured researches published by Yunfei Du.


international conference on cluster computing | 2010

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

Canqun Yang; Feng Wang; Yunfei Du; Juan Chen; Jie Liu; Huizhan Yi; Kai Lu

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime overhead, resulting in the better performance than the static or the training partitioning methods. The CPU-GPU communication overhead is effectively hidden by a software pipelining technique, which is particularly useful for large memory-bound applications. Combined with other traditional optimizations, the Linpack we optimized using the adaptive optimization framework achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability and 3.3 times faster than the result using the vendor’s library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list released in November 2009.


Journal of Computer Science and Technology | 2011

Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Feng Wang; Canqun Yang; Yunfei Du; Juan Chen; Huizhan Yi; Weixia Xu

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details. To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead. Combined with other traditional optimizations, the Linpack we developed achieved 196:7 GFLOPS on a single compute element of TianHe-1. This result is 70:1% of the peak compute capability, 3:3 times faster than the result by using the vendors library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0:563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.


IEEE Transactions on Parallel and Distributed Systems | 2009

FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing

Xuejun Yang; Yunfei Du; Panfeng Wang; Hongyi Fu; Jia Jia

As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the fault-tolerant parallel algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed get it fault-tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach.


international conference on parallel architectures and compilation techniques | 2007

The Fault Tolerant Parallel Algorithm: the Parallel Recomputing Based Failure Recovery

Xuejun Yang; Yunfei Du; Panfeng Wang; Hongyi Fu; Jia Jia; Zhiyuan Wang; Guang Suo

This paper addresses the issue of fault tolerance in parallel computing, and proposes a new method named parallel recomputing. Such method achieves fault recovery automatically by using surviving processes to recompute the workload of failed processes in parallel. The paper firstly defines the fault tolerant parallel algorithm (FTPA) as the parallel algorithm which tolerates failures by parallel recomputing. Furthermore, the paper proposes the inter-process definition-use relationship analysis method based on the conventional definition-use analysis for revealing the relationship of variables in different processes. Under the guidance of this new method, principles of fault tolerant parallel algorithm design are given. At last, the authors present the design of FTPAs for matrix-matrix multiplication and NPB kernels, and evaluate them by experiments on a cluster system. The experimental results show that the overhead of FTPA is less than the overhead of checkpointing.


high performance computing and communications | 2008

Static Analysis for Application-Level Checkpointing of MPI Programs

Panfeng Wang; Yunfei Du; Hongyi Fu; Xuejun Yang; Haifang Zhou

Application-level checkpointing is a promising technology in the domain of large-scale scientific computing. The consistency of global checkpoint must be carefully guaranteed in order to correctly restore the computation. Usually, some complex coordinated protocols are employed to ensure the consistency of global checkpoint, which require logging orphan or in-transit messages during checkpointing. These protocols complicate the recovery of the computation and increase the checkpoint overhead due to logging message. In this paper, a new method which ensures the consistency of global checkpoint by static analysis is proposed. The method identifies the safe checkpointing regions in MPI programs, where the global checkpoint is always strongly consistent. All checkpoints are located in those safe checkpoint regions. During checkpointing, the method will not log any messages and introduce no extra overhead. The method was implemented and integrated into ALEC, which is a source-to-source precompiler for automating application-level checkpointing. The experimental results show that our method is effective.


computer and information technology | 2007

Building Single Fault Survivable Parallel Algorithms for Matrix Operations Using Redundant Parallel Computation

Yunfei Du; Panfeng Wang; Hongyi Fu; Jia Jia; Haifang Zhou; Xuejun Yang

As the size of todays high performance computers continue to grow, node failures in these computers are becoming frequent events. Although checkpoint is the typical technique to tolerate such failures, it often introduces a considerable overhead and has shown poor scalability on todays large scale systems. In this paper we defined a new term called fault tolerant parallel algorithm which means that the algorithm gets the correct answer despite the failure of nodes. The fault tolerance approach in which the data of failed processes is recovered by modifying applications to recompute on all surviving processes is checkpoint-free. In particular, if no failure occurs, the fault tolerant parallel algorithms are the same as the original algorithms. We show the practicality of this technique by applying it to parallel dense matrix-matrix multiplication and Gaussian elimination to tolerate single process failure. Experimental results demonstrate that a process failure can be tolerated with a good scalability for the two fault tolerant parallel algorithms and the proposed fault tolerant parallel dense matrix-matrix multiplication is able to survive process failure with a very low performance overhead. The main drawback of this approach is non-transparent and algorithm-dependent.


international conference on distributed computing systems | 2008

Compiler-Assisted Application-Level Checkpointing for MPI Programs

Xuejun Yang; Panfeng Wang; Hongyi Fu; Yunfei Du; Zhiyuan Wang; Jia Jia

Application-level checkpointing can decrease the overhead of fault tolerance by minimizing the amount of checkpoint data. However this technique requires the programmer to manually choose the critical data that should be saved. In this paper, we firstly propose a live-variable analysis method for MPI programs. Then, we provide an optimization method of data saving for application-level checkpointing based on the analysis method. Based on the theoretical foundation, we implement a source-to-source precompiler (ALEC) to automate application-level checkpointing. Finally, we evaluate the performance of five FORTRAN/MPI programs which are transformed and integrated checkpointing features by ALEC on a 512-CPU cluster system. The experimental results show that i) the application-level checkpointing based on live-variable analysis for MPI programs can efficiently reduce the amount of checkpoint data, thereby decrease the overhead of checkpoint and restart; ii) ALEC is capable of automating application-level checkpointing correctly and effectively.


acm sigplan symposium on principles and practice of parallel programming | 2008

Automated application-level checkpointing based on live-variable analysis in MPI programs

Panfeng Wang; Xuejun Yang; Hongyi Fu; Yunfei Du; Zhiyun Wang; Jia Jia

This paper proposes an optimization method of data saving for application-level checkpointing based on the live-variable analysis method for MPI programs. We presents the implementation of a source-to-source precompiler (CAC) for automating applicationlevel checkpointing based on the optimization method. The experiment shows that CAC is capable of automating application-level checkpointing correctly and reducing checkpoint data effectively.


international symposium on parallel and distributed processing and applications | 2006

A parallel mutual information based image registration algorithm for applications in remote sensing

Yunfei Du; Haifang Zhou; Panfeng Wang; Xuejun Yang; Hengzhu Liu

Image registration is a classical problem that addresses the problem of finding a geometric transformation that best aligns two images. Since the amount of multisensor remote sensing imagery are growing tremendously, the search for matching transformation with mutual information is very time-consuming and tedious, and fast and automatic registration of images from different sensors has become critical in the remote sensing framework. So the implementation of automatic mutual information based image registration methods on high performance machines needs to be investigated. First, this paper presents a parallel implementation of a mutual information based image registration algorithm. It takes advantage of cluster machines by partitioning of data depending on the algorithms peculiarity. Then, the evaluation of the parallel registration method has been presented in theory and in experiments and shows that the parallel algorithm has good parallel performance and scalability.


international conference on parallel and distributed systems | 2009

Solving 2D Nonlinear Unsteady Convection-Diffusion Equations on Heterogenous Platforms with Multiple GPUs

Canqun Yang; Zhen Ge; Juan Chen; Feng Wang; Yunfei Du

Solving complex convection-diffusion equations is very important to many practical mathematical and physical problems. After the finite difference discretization, most of the time for equations solution is spent on sparse linear equation solvers. In this paper, our goal is to solve 2D Nonlinear Unsteady Convection-Diffusion Equations by accelerating an iterative algorithm named Jacobi-preconditioned QMRCGSTAB on a heterogenous platform, which is composed of a multi-core processor and multiple GPUs. Firstly, a basic implementation and evaluation for adapting the problem to this kind of platform is given. Then, we propose two optimization methods to improve the performance: kernel merging method and matrix boundary data processing. Our experimental evaluation on an AMD Opteron(tm) quad-core processor 2380 linked to an NVIDIA Tesla S1070 platform with four GPUs delivers the peak performance of 33 GFLOPS (double precision), which is a speedup of close to a factor 32 compared to the same problem running on 4 cores of the same CPU.

Collaboration


Dive into the Yunfei Du's collaboration.

Top Co-Authors

Avatar

Canqun Yang

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Juan Chen

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Feng Wang

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Huizhan Yi

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Chun Huang

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Panfeng Wang

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Xuejun Yang

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Kejia Zhao

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Hongyi Fu

National University of Defense Technology

View shared research outputs
Top Co-Authors

Avatar

Jia Jia

National University of Defense Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge