Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where James John Elliott is active.

Publication


Featured researches published by James John Elliott.


international conference on distributed computing systems | 2012

Combining Partial Redundancy and Checkpointing for HPC

James John Elliott; Kishor Kharbas; David Fiala; Frank Mueller; Kurt Brian Ferreira; Christian Engelmann

Todays largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance technique, which employs redundant processes performing the same task. If a process fails, a replica of it can take over its execution. Thus, redundant copies can decrease the overall failure rate. The downside of redundancy is that extra resources are required and there is an additional overhead on communication and synchronization. This work contributes a model and analyzes the benefit of C/R in coordination with redundancy at different degrees to minimize the total wallclock time and resources utilization of HPC applications. We further conduct experiments with an implementation of redundancy within the MPI layer on a cluster. Our experimental results confirm the benefit of dual and triple redundancy - but not for partial redundancy - and show a close fit to the model. At ≈ 80, 000 processes, dual redundancy requires twice the number of processing resources for an application but allows two jobs of 128 hours wallclock time to finish within the time of just one job without redundancy. For narrow ranges of processor counts, partial redundancy results in the lowest time. Once the count exceeds ≈ 770, 000, triple redundancy has the lowest overall cost. Thus, redundancy allows one to trade-off additional resource requirements against wallclock time, which provides a tuning knob for users to adapt to resource availabilities.


availability, reliability and security | 2009

Blue Gene/L Log Analysis and Time to Interrupt Estimation

Narate Taerat; Nichamon Naksinehaboon; Clayton Chandler; James John Elliott; Chokchai Leangsuksun; George Ostrouchov; Stephen L. Scott; Christian Engelmann

System- and application-level failures could be characterized by analyzing relevant log files. The resulting data might then be used in numerous studies on and future developments for the mission-critical and large scale computational architecture, including fields such as failure prediction, reliability modeling, performance modeling and power awareness. In this paper, system logs covering a six month period of the Blue Gene/L supercomputer were obtained and subsequently analyzed. Temporal filtering was applied to remove duplicated log messages. Optimistic and pessimistic perspectives were exerted on filtered log information to observe failure behavior within the system. Further, various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research.


international parallel and distributed processing symposium | 2014

Evaluating the Impact of SDC on the GMRES Iterative Solver

James John Elliott; Mark Hoemmen; Frank Mueller


Archive | 2013

Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic

James John Elliott; Frank Mueller; Miroslav K Stoyanov; Clayton G Webster


high performance distributed computing | 2015

A Numerical Soft Fault Model for Iterative Linear Solvers

James John Elliott; Mark Hoemmen; Frank Mueller


Journal of Computational Science | 2016

Exploiting data representation for fault tolerance

James John Elliott; Mark Hoemmen; Frank Mueller


arXiv: Mathematical Software | 2014

Resilience in Numerical Methods: A Position on Fault Models and Methodologies.

James John Elliott; Mark Hoemmen; Frank Mueller


arXiv: Distributed, Parallel, and Cluster Computing | 2014

Tolerating Silent Data Corruption in Opaque Preconditioners

James John Elliott; Mark Hoemmen; Frank Mueller


Archive | 2015

The cost of reliability: Iterative linear solvers and reactive fault tolerance.

James John Elliott; Mark Hoemmen; Frank Mueller


Archive | 2014

Resilient iterative linear solvers via skeptical programming.

Mark Hoemmen; James John Elliott; Frank Mueller

Collaboration


Dive into the James John Elliott's collaboration.

Top Co-Authors

Avatar

Frank Mueller

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Mark Hoemmen

Sandia National Laboratories

View shared research outputs
Top Co-Authors

Avatar

Christian Engelmann

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Clayton G Webster

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

David Fiala

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Galen E. Turner

Louisiana Tech University

View shared research outputs
Top Co-Authors

Avatar

George Ostrouchov

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Kishor Kharbas

North Carolina State University

View shared research outputs
Researchain Logo
Decentralizing Knowledge