Aiman Fang
University of Chicago
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Aiman Fang.
international conference on conceptual structures | 2015
Andrew A. Chien; Pavan Balaji; P. Beckman; Nan Dun; Aiman Fang; Hajime Fujita; Kamil Iskra; Zachary A. Rubenstein; Z. Zheng; R. Schreiber; J. Hammond; J. Dinan; Ignacio Laguna; D. Richards; A. Dubey; B. van Straalen; Mark Hoemmen; Michael A. Heroux; Keita Teranishi; Andrew R. Siegel
Abstract Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVRs interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small ( 2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads 2% are achieved. We conclude that GVRs interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.
ieee international conference on high performance computing, data, and analytics | 2016
Nan Dun; D. Pleiter; Aiman Fang; Nicolas Vandenbergen; Andrew A. Chien
Resilience has become a major concern in high-performance computing (HPC) systems. Addressing the increasing risk of latent errors (or silent data corruption) is one of the biggest challenges. Multi-version checkpointing system, which keeps multi-version of the application states, has been proposed as a solution and has been implemented in Global View Resilience (GVR). The resulting more sophisticated management of data introduces overheads and the resulting impact on performance need to be investigated. In this paper we explore the performance of GVR for an HPC system with integrated non-volatile memories, namely Blue Gene Active Storage (BGAS). Our empirical study shows that the BGAS system provides a significantly more efficient basis for flexible error recovery by using GVR multi-versioning features compared to using a standard external storage system attached to the same Blue Gene/Q installation. Using BGAS especially achieves at least \(10\times \) performance boost for random traversal across multiple versions due to significantly better performance for small random I/O operations.
International Journal of High Performance Computing Applications | 2017
Andrew A. Chien; Pavan Balaji; Nan Dun; Aiman Fang; Hajime Fujita; Kamil Iskra; Zachary A. Rubenstein; Ziming Zheng; Jeff R. Hammond; Ignacio Laguna; David F. Richards; A. Dubey; B. van Straalen; Mark Hoemmen; Michael A. Heroux; Keita Teranishi; Andrew R. Siegel
Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and “partially materialize” them efficiently makes ambitious forward-recovery based on “data slices” across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small (< 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads < 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR’s multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentle-slope path to tolerate growing error rates in future extreme-scale systems.
International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2017
Aurélien Cavelan; Aiman Fang; Andrew A. Chien; Yves Robert
This paper presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the inherent difficulty of the analysis. With ABFR, the crucial parameter is the detection interval, which bounds the error latency. We show that the detection interval has a dramatic impact on the overhead, and that optimally choosing its value leads to significant gains over the CR approach.
international conference on cluster computing | 2015
Nan Dun; Hajime Fujita; Aiman Fang; Yan Liu; Andrew A. Chien; Pavan Balaj; Kamil Iskra; Wesley Bland; Andrew R. Siegel
We present the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We briefly describe GVRs interfaces for distributed arrays, versioning, and cross-layer error recovery. We illustrate how GVR can be used for rollback recovery and a wide range additional error recovery techniques including forward recovery for latent errors or silent data corruptions. Application results demonstrate that GVRs interfaces and implementation are portable, flexible (support a variety of recovery models), efficient and create a gentle-slope path to tolerate growing error rates in future systems.
high performance distributed computing | 2018
Aiman Fang; Andrew A. Chien
Exascale systems face high error-rates due to increasing scale (109 cores), software complexity and rising memory error rates. Increasingly, errors escape immediate hardware-level detection, silently corrupting application states. Such latent errors can often be detected by application-level tests but typically at long latencies. We propose a new approach called application-based focused recovery (ABFR), that defines the application knowledge needed for efficient latent error recovery. This allows the application to pursue strategies exploiting a range of application semantics within a well-defined resilience framework. The ABFR runtime then exploits this knowledge to achieve efficient latent error tolerance. ABFR enables application designers to express resilience without concern for the underlying architectures and systems. Together, these ABFR properties support flexible application-based resilience. To demonstrate its generality, we apply ABFR to three varied scientific computations (stencil, N-Body tree, and Monte Carlo). We measure latent error resilience performance for varied error rates; results indicate significant reductions in error recovery cost (up to 367x) and recovery latency (up to 24x). And ABFR achieves efficient and scalable recovery at scale with high latent error rates for these computations.
Archive | 2016
Aiman Fang; Ignacio Laguna; Kento Sato; Tanzima Islam; Kathryn Mohror
Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enables failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.
european conference on parallel processing | 2015
Aiman Fang; Hajime Fujita; Andrew A. Chien
We explore the post-recovery efficiency of shrinking and non-shrinking recovery schemes on high performance computing systems using a synthetic benchmark. We study the impact of network topology on post-recovery communication performance. Our experiments on the IBM BG/Q System Mira show that shrinking recovery can deliver up to 7.5 % better efficiency for neighbor communication pattern, as the non-shrinking recovery can reduce communication performance. We expected a similar situation for our synthetic benchmark with collective communication, but the situation is quite different. Both shrinking and non-shrinking recovery reduce MPI performance (MPICH3.1) dramatically on collective communication; up to 14\(\times \) worse, swamping any differences between the two approaches. This suggests that making MPI performance less sensitive to irregularity in performance and communicator size are critical for both recovery approaches.
Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale | 2015
Aiman Fang; Andrew A. Chien
international conference on parallel processing | 2017
Aiman Fang; Aurélien Cavelan; Yves Robert; Andrew A. Chien