Is this you? Create Your Porfile

Panruo Wu

University of California, Riverside

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Panruo Wu is active.

Explore More

Publication

Featured researches published by Panruo Wu.

ieee international conference on high performance computing data and analytics | 2013

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

Dong Li; Zizhong Chen; Panruo Wu; Jeffrey S. Vetter

Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware resilience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to redundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled applications. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and investigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic memory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.

high performance distributed computing | 2014

FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines

Panruo Wu; Zizhong Chen

It is well known that soft errors in linear algebra operations can be detected off-line at the end of the computation using algorithm-based fault tolerance (ABFT). However, traditional ABFT usually cannot correct errors in Cholesky, QR, and LU factorizations because any error in one matrix element will be propagated to many other matrix elements and hence cause too many errors to correct. Although, recently, tremendous progresses have been made to correct errors in LU and QR factorizations, these new techniques correct errors off-line at the end of the computation after errors propagated and accumulated, which significantly complicates the error correction process and introduces at least quadratically increasing overhead as the number of errors increases. In this paper, we present the design and implementation of FT-ScaLAPACK, a fault tolerant version ScaLAPACK that is able to detect, locate, and correct errors in Cholesky, QR, and LU factorizations on-line in the middle of the computation in a timely manner before the errors propagate and accumulate. FT-ScaLAPACK has been validated with thousands of cores on Stampede at the Texas Advanced Computing Center. Experimental results demonstrate that FT-ScaLAPACK is able to achieve comparable performance and scalability with the original ScaLAPACK.

IEEE Transactions on Signal Processing | 2017

Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse Support

Jianbo Ye; Panruo Wu; James Ze Wang; Jia Li

In a variety of research areas, the weighted bag of vectors and the histogram are widely used descriptors for complex objects. Both can be expressed as discrete distributions. D2-clustering pursues the minimum total within-cluster variation for a set of discrete distributions subject to the Kantorovich–Wasserstein metric. D2-clustering has a severe scalability issue, the bottleneck being the computation of a centroid distribution, called Wasserstein barycenter, that minimizes its sum of squared distances to the cluster members. In this paper, we develop a modified Bregman ADMM approach for computing the approximate discrete Wasserstein barycenter of large clusters. In the case when the support points of the barycenters are unknown and have low cardinality, our method achieves high accuracy empirically at a much reduced computational cost. The strengths and weaknesses of our method and its alternatives are examined through experiments, and we recommend scenarios for their respective usage. Moreover, we develop both serial and parallelized versions of the algorithm. By experimenting with large-scale data, we demonstrate the computational efficiency of the new methods and investigate their convergence properties and numerical stability. The clustering results obtained on several datasets in different domains are highly competitive in comparison with some widely used methods in the corresponding areas.

international parallel and distributed processing symposium | 2015

Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing

Li Tan; Shuaiwen Leon Song; Panruo Wu; Zizhong Chen; Rong Ge; Darren J. Kerbyson

Energy efficiency and resilience are two crucial challenges for HPC systems to reach exactable. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervaluing approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervaluing. Our strategy is directed by analytic models, which capture the impact of undervaluing and the interplay between energy efficiency and resilience. Experimental results on a power-aware cluster demonstrate that our approach can save up to 12.1% energy compared to the baseline, and conserve up to 9.1% more energy than a state-of-the-art DVFS solution.

high performance distributed computing | 2016

Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory

Panruo Wu; Dong Li; Zizhong Chen; Jeffrey S. Vetter; Sparsh Mittal

The emergence of many non-volatile memory (NVM) techniques is poised to revolutionize main memory systems because of the relatively high capacity and low lifetime power consumption of NVM. However, to avoid the typical limitation of NVM as the main memory, NVM is usually combined with DRAM to form a hybrid NVM/DRAM system to gain the benefits of each. However, this integrated memory system raises a question on how to manage data placement and movement across NVM and DRAM, which is critical for maximizing the benefits of this integration. The existing solutions have several limitations, which obstruct adoption of these solutions in the high performance computing (HPC) domain. In particular, they cannot take advantage of application semantics, thus losing critical optimization opportunities and demanding extensive hardware extensions; they implement persistent semantics for resilience purpose while suffering large performance and energy overhead. In this paper, we re-examine the current hybrid memory designs from the HPC perspective, and aim to leverage the knowledge of numerical algorithms to direct data placement. With explicit algorithm management and limited hardware support, we optimize data movement between NVM and DRAM, improve data locality, and implement a relaxed memory persistency scheme in NVM. Our work demonstrates significant benefits of integrating algorithm knowledge into the hybrid memory design to achieve multi-dimensional optimization (performance, energy, and resilience) in HPC.

IEEE Transactions on Parallel and Distributed Systems | 2015

Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition

Doug Hakkarinen; Panruo Wu; Zizhong Chen

Cholesky decomposition is a widely used algorithm to solve linear equations with symmetric and positive definite coefficient matrix. With large matrices, this often will be performed on high performance supercomputers with a large number of processors. Assuming a constant failure rate per processor, the probability of a failure occurring during the execution increases linearly with additional processors. Fault tolerant methods attempt to reduce the expected execution time by allowing recovery from failure. This paper presents an analysis and implementation of a fault tolerant Cholesky factorization algorithm that does not require checkpointing for recovery from fail-stop failures. Rather, this algorithm uses redundant data added in an additional set of processes. This differs from previous works with algorithmic methods as it addresses fail-stop failures rather than fail-continue cases. The proposed fault tolerance scheme is incorporated into ScaLAPACK and validated on the supercomputer Kraken. Experimental results demonstrate that this method has decreasing overhead in relation to overall runtime as the matrix size increases, and thus shows promise to reduce the expected runtime for Cholesky factorizations on very large matrices.

Journal of Computational Science | 2013

On-line soft error correction in matrix-matrix multiplication

Panruo Wu; Chong Ding; Longxiang Chen; Teresa Davies; Christer Karlsson; Zizhong Chen

Abstract Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results cannot be trusted any more. A well known technique to correct soft errors in matrix–matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) – a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix–matrix multiplication can be detected in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e. less than 1%) performance penalty over the ATLAS dgemm() .

high performance distributed computing | 2016

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods

Dingwen Tao; Shuaiwen Leon Song; Sriram Krishnamoorthy; Panruo Wu; Xin Liang; Eddy Z. Zhang; Darren J. Kerbyson; Zizhong Chen

Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.

high performance distributed computing | 2016

Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra

Panruo Wu; Qiang Guan; Nathan DeBardeleben; Sean Blanchard; Dingwen Tao; Xin Liang; Jieyang Chen; Zizhong Chen

Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the architecture level and what the algorithm expects. As the fault model is the deciding factor in choosing an effective checksum scheme, the resulting ABFT techniques have seen limited impact in practice. In this paper we seek to close the gap by directly using a comprehensive architectural fault model and devise a comprehensive ABFT scheme that can tolerate multiple architectural faults of various kinds. We implement the new ABFT scheme into high performance linpack (HPL) to demonstrate the feasibility in large scale high performance benchmark. We conduct architectural fault injection experiments and large scale experiments to empirically validate its fault tolerance and demonstrate the overhead of error handling, respectively.

ieee international conference on high performance computing data and analytics | 2016

GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing

Jieyang Chen; Li Tan; Panruo Wu; Dingwen Tao; Hongbo Li; Xin Liang; Sihuan Li; Rong Ge; Laxmi N. Bhuyan; Zizhong Chen

While many linear algebra libraries have been developed to optimize their performance, no linear algebra library considers their energy efficiency at the library design time. In this paper, we present GreenLA - an energy efficient linear algebra software package that leverages linear algebra algorithmic characteristics to maximize energy savings with negligible overhead. GreenLA is (1) energy efficient: it saves up to several times more energy than the best existing energy saving approaches that do not modify library source codes; (2) high performance: its performance is comparable to the highly optimized linear algebra library MAGMA; and (3) transparent to applications: with the same programming interface, existing MAGMA users do not need to modify their source codes to benefit from GreenLA. Experimental results demonstrate that GreenLA is able to save up to three times more energy than the best existing energy saving approaches while delivering similar performance compared to the state-of-the-art linear algebra library MAGMA.

Explore More