Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xingfu Wu is active.

Publication


Featured researches published by Xingfu Wu.


measurement and modeling of computer systems | 2003

Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications

Valerie E. Taylor; Xingfu Wu; Rick Stevens

Performance is an important issue with any application, especially grid applications. Efficient execution of applications requires insight into how the system features impact the performance of the applications. This insight generally results from significant experimental analysis and possibly the development of performance models. This paper present the Prophesy system, for which the novel component is the model development. In particular, this paper discusses the use of our coupling parameter (i.e., a metric that attempts to quantify the interaction between kernels that compose an application) to develop application models. We discuss how this modeling technique can be used in the analysis of grid applications.


Computer Science - Research and Development | 2012

Power-aware predictive models of hybrid (MPI/OpenMP) scientific applications on multicore systems

Charles W. Lively; Xingfu Wu; Valerie E. Taylor; Shirley Moore; Hung-Ching Chang; Chun-Yi Su; Kirk W. Cameron

Predictive models enable a better understanding of the performance characteristics of applications on multicore systems. Previous work has utilized performance counters in a system-centered approach to model power consumption for the system, CPU, and memory components. Often, these approaches use the same group of counters across different applications. In contrast, we develop application-centric models (based upon performance counters) for the runtime and power consumption of the system, CPU, and memory components. Our work analyzes four Hybrid (MPI/OpenMP) applications: the NAS Parallel Multizone Benchmarks (BT-MZ, SP-MZ, LU-MZ) and a Gyrokinetic Toroidal Code, GTC. Our models show that cache utilization (L1/L2), branch instructions, TLB data misses, and system resource stalls affect the performance of each application and performance component differently. We show that the L2 total cache hits counter affects performance across all applications. The models are validated for the system and component power measurements with an error rate less than 3%.


ieee international conference on high performance computing data and analytics | 2011

Energy and performance characteristics of different parallel implementations of scientific applications on multicore systems

Charles W. Lively; Xingfu Wu; Valerie E. Taylor; Shirley Moore; Hung-Ching Chang; Kirk W. Cameron

Energy consumption is a major concern with high-performance multicore systems. In this paper, we explore the energy consumption and performance (execution time) characteristics of different parallel implementations of scientific applications. In particular, the experiments focus on message-passing interface (MPI)-only versus hybrid MPI/OpenMP implementations for hybrid the NAS (NASA Advanced Supercomputing) BT (Block Tridiagonal) benchmark (strong scaling), a Lattice Boltzmann application (strong scaling), and a Gyrokinetic Toroidal Code — GTC (weak scaling), as well as central processing unit (CPU) frequency scaling. Experiments were conducted on a system instrumented to obtain power information; this system consists of eight nodes with four cores per node. The results indicate, with respect to the MPI-only versus the hybrid implementation, that the best implementation is dependent upon the application executed on 16 or fewer cores. For the case of 32 cores, the results were consistent in that hybrid implementation resulted in less execution time and energy. With CPU frequency scaling, the best case for energy saving was not the best case for execution time.


high performance distributed computing | 2002

Using kernel couplings to predict parallel application performance

Valerie E. Taylor; Xingfu Wu; Jonathan Geisler; Rick Stevens

Performance models provide significant insight into the performance relationships between an application and the system used for execution. The major obstacle to developing performance models is the lack of knowledge about the performance relationships between the different functions that compose an application. This paper addresses the issue by using a coupling parameter, which quantifies the interaction between kernels, to develop performance predictions. The results, using three NAS parallel application benchmarks, indicate that the predictions using the coupling parameter were greatly improved over a traditional technique of summing the execution times of the individual kernels in an application. In one case the coupling predictor had less than 1% relative error in contrast the summation methodology that had over 20% relative error. Further, as the problem size and number of processors scale, the coupling values go through a finite number of major value changes that is dependent on the memory subsystem of the processor architecture.


high performance distributed computing | 2000

Prophesy: an infrastructure for analyzing and modeling the performance of parallel and distributed applications

Valerie E. Taylor; Xingfu Wu; Jonathan Geisler; Xin Li; Zhiling Lan; Rick Stevens; Mark Hereld; Ivan R. Judson

Efficient execution of applications requires insight into how the system features impact the performance of the application. For distributed systems, the task of gaining this insight is complicated by the complexity of the system features. This insight generally results from significant experimental analysis and possibly the development of performance models. This paper presents the Prophesy project, an infrastructure that aids in gaining this needed insight based upon experience. The core component of Prophesy is a relational database that allows for the recording of performance data, system features and application details.


measurement and modeling of computer systems | 2011

Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers

Xingfu Wu; Valerie E. Taylor

The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore supercomputers provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the data sharing with the multicores that comprise a node and MPI can be used with the communication between nodes. In this paper, we use SP and BT benchmarks of MPI NPB 3.3 as a basis for a comparative approach to implement hybrid MPI/OpenMP versions of SP and BT. In particular, we can compare the performance of the hybrid SP and BT with the MPI counterparts on large-scale multicore supercomputers. Our performance results indicate that the hybrid SP outperforms the MPI SP by up to 20.76%, and the hybrid BT outperforms the MPI BT by up to 8.58% on up to 10,000 cores on BlueGene/P at Argonne National Laboratory and Jaguar (Cray XT4/5) at Oak Ridge National Laboratory. We also use performance tools and MPI trace libraries available on these supercomputers to further investigate the performance characteristics of the hybrid SP and BT.


international parallel and distributed processing symposium | 2009

Performance projection of HPC applications using SPEC CFP2006 benchmarks

Sameh S. Sharkawi; Don DeSota; Raj Panda; Rajeev Indukuru; Stephen Stevens; Valerie E. Taylor; Xingfu Wu

Performance projections of High Performance Computing (HPC) applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems, enable them to compare the application performance across different existing and future systems, and help HPC users with system procurement and application refinements. In this paper, we present a method for projecting the node level performance of HPC applications using published data of industry standard benchmarks, the SPEC CFP2006, and hardware performance counter data from one base machine. In particular, we project performance of eight HPC applications onto four systems, utilizing processors from different vendors, using data from one base machine, the IBM p575. The projected performance of the eight applications was within 7.2% average difference with respect to measured runtimes for IBM POWER6 systems and standard deviation of 5.3%. For two Intel based systems with different micro-architecture and Instruction Set Architecture (ISA) than the base machine, the average projection difference to measured runtimes was 10.5% with standard deviation of 8.2%.


international conference on cluster computing | 2006

Performance Analysis, Modeling and Prediction of a Parallel Multiblock Lattice Boltzmann Application Using Prophesy System

Xingfu Wu; Valerie E. Taylor; Shane Garrick; Dazhi Yu; Jacques C. Richard

The Lattice Boltzmann method is widely used in simulating fluid flows. In this paper, we present the performance analysis, modeling and prediction of a parallel multiblock Lattice Boltzmann application on up to 512 processors on three SMP clusters: two IBM SP systems at San Diego Supercomputing Center (DataStar - p655 and p690) and one IBM SP system at the DOE National Energy Research Scientific Computing Center (Seaborg) using the Prophesy system. By characterizing the performance of the Lattice Boltzmann application as the problem size and the number of processors increase, we can identify and eliminate performance bottlenecks, and predict the application performance. The experimental results indicate that the application with large problem sizes scales well across these three clusters, and performance models using the coupling method are accurate with less than 4.8% average relative prediction error


Journal of Computer and System Sciences | 2013

Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers

Xingfu Wu; Valerie E. Taylor

In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, MPI and hybrid applications with weak scaling on three large-scale multicore clusters: IBM POWER4, POWER5+ and Blue Gene/P, and analyze the performance of these MPI, OpenMP and hybrid applications. We use STREAM memory benchmarks to provide initial performance analysis and model validation of MPI and OpenMP applications on these multicore clusters because the measured sustained memory bandwidth can provide insight into the memory bandwidth that a system should sustain on scientific applications with the same amount of workload per core. In addition to using these benchmarks, we also use a weak-scaling hybrid MPI/OpenMP large-scale scientific application: Gyro kinetic Toroidal Code in magnetic fusion to validate our performance model of the hybrid application on these multicore clusters. The validation results for our performance modeling method show less than 7.77% error rate in predicting the performance of hybrid MPI/OpenMP GTC on up to 512 cores on these multicore clusters.


The Computer Journal | 2012

Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Clusters

Xingfu Wu; Valerie E. Taylor

The NAS Parallel Benchmarks (NPB) are well-known applications with fixed algorithms for evaluating parallel systems and tools. Multicore clusters provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the data sharing with the multicores that comprise a node, and MPI can be used with the communication between nodes. In this paper, we use Scalar Pentadiagonal (SP) and Block Tridiagonal (BT) benchmarks of MPI NPB 3.3 as a basis for a comparative approach to implement hybrid MPI/OpenMP versions of SP and BT. In particular, we can compare the performance of the hybrid SP and BT with the MPI counterparts on large-scale multicore clusters, Intrepid (BlueGene/P) at Argonne National Laboratory and Jaguar (Cray XT4/5) at Oak Ridge National Laboratory. Our performance results indicate that the hybrid SP outperforms the MPI SP by up to 20.76%, and the hybrid BT outperforms the MPI BT by up to 8.58% on up to 10Â 000 cores on Intrepid and Jaguar. We also use performance tools and MPI trace libraries available on these clusters to further investigate the performance characteristics of the hybrid SP and BT.

Collaboration


Dive into the Xingfu Wu's collaboration.

Top Co-Authors

Avatar

Rick Stevens

Argonne National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jonathan Geisler

Argonne National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Shirley Moore

University of Texas at El Paso

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge