Nichamon Naksinehaboon
Louisiana Tech University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nichamon Naksinehaboon.
international parallel and distributed processing symposium | 2008
Yudan Liu; Raja Nassar; Chokchai Leangsuksun; Nichamon Naksinehaboon; Mihaela Paun; Stephen L. Scott
The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss (rollback and checkpoint overheads) due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims at addressing fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can deal with a varying checkpoint interval and with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.
cluster computing and the grid | 2008
Nichamon Naksinehaboon; Yudan Liu; Chokchai Leangsuksun; Raja Nassar; Mihaela Paun; Stephen L. Scott
For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model (Liu et al., 2007) on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.
international conference on cluster computing | 2007
Yudan Liu; Raja Nassar; Chokchai Leangsuksun; Nichamon Naksinehaboon; Mihaela Paun; Stephen L. Scott
The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads. Our scheme aims to address fault tolerance challenge especially in a large-scale HPC system by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.
availability, reliability and security | 2009
Narate Taerat; Nichamon Naksinehaboon; Clayton Chandler; James John Elliott; Chokchai Leangsuksun; George Ostrouchov; Stephen L. Scott; Christian Engelmann
System- and application-level failures could be characterized by analyzing relevant log files. The resulting data might then be used in numerous studies on and future developments for the mission-critical and large scale computational architecture, including fields such as failure prediction, reliability modeling, performance modeling and power awareness. In this paper, system logs covering a six month period of the Blue Gene/L supercomputer were obtained and subsequently analyzed. Temporal filtering was applied to remove duplicated log messages. Optimistic and pessimistic perspectives were exerted on filtered log information to observe failure behavior within the system. Further, various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research.
international symposium on parallel and distributed processing and applications | 2010
Nichamon Naksinehaboon; Narate Taerat; Chokchai Leangsuksun; Clayton Chandler; Stephen L. Scott
Rejuvenation is a technique expected to mitigate failures in HPC systems by replacing, repairing, or resetting system components. Because of the small overhead required by software rejuvenation, we primarily focus on OS/kernel rejuvenation. In this paper, we propose three rejuvenation scheduling techniques. Moreover, we investigate the claim that software rejuvenation prolongs failure times in HPC systems. Also, we compare the lost computing times of the checkpoint/restart mechanism with and without rejuvenation after each checkpoint.
International Journal of Foundations of Computer Science | 2010
Mihaela Paun; Nichamon Naksinehaboon; Raja Nassar; Chokchai Leangsuksun; Stephen L. Scott; Narate Taerat
Incremental checkpoint mechanism was introduced to reduce high checkpoint overhead of regular (full) checkpointing, especially in high-performance computing systems. To gain an extra advantage from the incremental checkpoint technique, we propose an optimal checkpoint frequency function that globally minimizes the expected wasted time of the incremental checkpoint mechanism. Also, the re-computing time coefficient used to approximate the re-computing time is derived. Moreover, to reduce the complexity in the recovery state, full checkpoints are performed from time to time. In this paper we present an approach to evaluate the appropriate constant number of incremental checkpoints between two consecutive full checkpoints. Although the number of incremental checkpoints is constant, the checkpoint interval derived from the proposed model varies depending on the failure rate of the system. The checkpoint time is illustrated in the case of a Weibull distribution and can be easily simplified to the exponential case.
acm sigplan symposium on principles and practice of parallel programming | 2009
Stephen L. Scott; Christian Engelmann; Geoffroy Vallée; Thomas Naughton; Anand Tikotekar; George Ostrouchov; Chokchai Leangsuksun; Nichamon Naksinehaboon; Raja Nassar; Mihaela Paun; Frank Mueller; Chao Wang; Arun Babu Nagarajan; Jyothish Varma
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
acs/ieee international conference on computer systems and applications | 2011
Supada Laosooksathit; Nichamon Naksinehaboon; Chokchai Leangsuksan
Due to the fact that the reliability and availability of a large scaled system inverse to the number of computing elements, fault tolerance has become a major concern in high performance computing (HPC) including a very large system with GPGPU. In this paper, we propose a checkpoint/restart mechanism model which employs two-phase protocol and a latency hiding technique such as CUDA streams in order to achieve a low checkpoint overhead. We introduce GPU checkpoint and restart protocols. Also, we show experimental results and analyze the influences of the mechanism, especially in a long-running application.
international symposium on parallel and distributed processing and applications | 2010
Narate Taerat; Chokchai Leangsuksun; Clayton Chandler; Nichamon Naksinehaboon
The number of failures occurring in large-scale high performance computing (HPC) systems is significantly increasing due to the large number of physical components found on the system. Fault tolerance (FT) mechanisms help parallel applications mitigate the impact of failures. However, using such mechanisms requires additional overhead. As such, failure prediction is needed in order to smartly utilize FT mechanisms. Hence, the proficiency of a failure prediction determines the efficiency of FT mechanism utilization. The proficiency of a failure predictor in HPC is usually designated by well-known error measurements, e.g. MSE, MAD, precision and recall, in which less error infers the greater proficiency. In this manuscript, we propose to view prediction proficiency from another aspect—lost computing time. We then discuss the insufficiency of error measurements as HPC failure prediction proficiency metrics from the aspect of lost computing time, and propose novel metrics that address these issues.
international conference on parallel processing | 2018
Ramin Izadpanah; Nichamon Naksinehaboon; Jim M. Brandt; Ann C. Gentile; Damian Dechev
The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. While raw performance data is increasingly exposed at the component level, the usefulness of the data is dependent on the ability to do meaningful analysis on actionable timescales. However, current system monitoring infrastructures largely focus on data collection, with analysis performed off-system in post-processing mode. This increases the time required to provide analysis and feedback to a variety of consumers. In this work, we enhance the architecture of a monitoring system used on large-scale computational platforms, to integrate streaming analysis capabilities at arbitrary locations within its data collection, transport, and aggregation facilities. We leverage the flexible communication topology of the monitoring system to enable placement of transformations based on overhead concerns, while still enabling low-latency exposure on node. Our design internally supports and exposes the raw and transformed data uniformly for both node level and off-system consumers. We show the viability of our implementation for a case with production-relevance: run-time determination of the relative per-node files system demands.