Narate Taerat
Louisiana Tech University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Narate Taerat.
availability, reliability and security | 2009
Narate Taerat; Nichamon Naksinehaboon; Clayton Chandler; James John Elliott; Chokchai Leangsuksun; George Ostrouchov; Stephen L. Scott; Christian Engelmann
System- and application-level failures could be characterized by analyzing relevant log files. The resulting data might then be used in numerous studies on and future developments for the mission-critical and large scale computational architecture, including fields such as failure prediction, reliability modeling, performance modeling and power awareness. In this paper, system logs covering a six month period of the Blue Gene/L supercomputer were obtained and subsequently analyzed. Temporal filtering was applied to remove duplicated log messages. Optimistic and pessimistic perspectives were exerted on filtered log information to observe failure behavior within the system. Further, various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research.
international symposium on parallel and distributed processing and applications | 2010
Nichamon Naksinehaboon; Narate Taerat; Chokchai Leangsuksun; Clayton Chandler; Stephen L. Scott
Rejuvenation is a technique expected to mitigate failures in HPC systems by replacing, repairing, or resetting system components. Because of the small overhead required by software rejuvenation, we primarily focus on OS/kernel rejuvenation. In this paper, we propose three rejuvenation scheduling techniques. Moreover, we investigate the claim that software rejuvenation prolongs failure times in HPC systems. Also, we compare the lost computing times of the checkpoint/restart mechanism with and without rejuvenation after each checkpoint.
international conference on cluster computing | 2007
Narasimha Raju Gottumukkala; Chokchai Leangsuksun; Narate Taerat; Raja Nassar; Stephen L. Scott
Failures and downtimes have severe impact on the performance of parallel programs in a large scale High Performance Computing (HPC) environment. There were several research efforts to understand the failure behavior of computing systems. However, the presence of multitude of hardware and software components required for uninterrupted operation of parallel programs make failure and reliability prediction a challenging problem. HPC run-time systems like checkpoint frameworks and resource managers rely on the reliability knowledge of resources to minimize the performance loss due to unexpected failures. In this paper, we first analyze the Time Between Failure (TBF) distribution of individual nodes from a 512-node HPC system. Time varying distributions like Weibull, lognormal and gamma are observed to have better goodness-of-fit as compared to exponential distribution. We then present a reliability-aware resource allocation model for parallel programs based on one of the time varying distributions and present reliability-aware resource allocation algorithms to minimize the performance loss due to failures. We show the effectiveness of reliability-aware resource allocation algorithms based on the actual failure logs of the 512 node system and parallel workloads obtained from LANL and SDSC. The simulation results indicate that applying reliability-aware resource allocation techniques reduce the overall waste time of parallel jobs by as much as 30%. A further improvement by 15% in waste time is possible by considering the job run lengths in reliability-aware scheduling.
Computer Science - Research and Development | 2011
Narate Taerat; Jim M. Brandt; Ann C. Gentile; Matthew H. Wong; Chokchai Leangsuksun
The rate of failures in HPC systems continues to increase as the number of components comprising the systems increases. System logs are one of the valuable information sources that can be used to analyze system failures and their root causes. However, system log files are usually too large and complex to analyze manually. There are some existing log clustering tools that seek to help analysts in exploring these logs, however they fail to satisfy our needs with respect to scalability, usability and quality of results. Thus, we have developed a log clustering tool to better address these needs. In this paper we present our novel approach and initial experimental results.
International Journal of Foundations of Computer Science | 2010
Mihaela Paun; Nichamon Naksinehaboon; Raja Nassar; Chokchai Leangsuksun; Stephen L. Scott; Narate Taerat
Incremental checkpoint mechanism was introduced to reduce high checkpoint overhead of regular (full) checkpointing, especially in high-performance computing systems. To gain an extra advantage from the incremental checkpoint technique, we propose an optimal checkpoint frequency function that globally minimizes the expected wasted time of the incremental checkpoint mechanism. Also, the re-computing time coefficient used to approximate the re-computing time is derived. Moreover, to reduce the complexity in the recovery state, full checkpoints are performed from time to time. In this paper we present an approach to evaluate the appropriate constant number of incremental checkpoints between two consecutive full checkpoints. Although the number of incremental checkpoints is constant, the checkpoint interval derived from the proposed model varies depending on the failure rate of the system. The checkpoint time is illustrated in the case of a Weibull distribution and can be easily simplified to the exponential case.
international conference on parallel processing | 2011
Jim M. Brandt; Frank Xiaoxiao Chen; Ann C. Gentile; Chokchai Leangsuksun; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; Narate Taerat; David C. Thompson; Matthew H. Wong
Building the effective HPC resilience mechanisms required for viability of next generation supercomputers will require in depth understanding of system and component behaviors. Our goal is to build an integrated framework for high fidelity long term information storage, historic and run-time analysis, algorithmic and visual information exploration to enable system understanding, timely failure detection/prediction, and triggering of appropriate response to failure situations. Since it is unknown what information is relevant and since potentially relevant data may be expressed in a variety of forms (e.g., numeric, textual), this framework must provide capabilities to process different forms of data and also support the integration of new data, data sources, and analysis capabilities. Further, in order to ensure ease of use as capabilities and data sources expand, it must also provide interactivity between its elements. This paper describes our integration of the capabilities mentioned above into our OVIS tool.
international symposium on parallel and distributed processing and applications | 2010
Narate Taerat; Chokchai Leangsuksun; Clayton Chandler; Nichamon Naksinehaboon
The number of failures occurring in large-scale high performance computing (HPC) systems is significantly increasing due to the large number of physical components found on the system. Fault tolerance (FT) mechanisms help parallel applications mitigate the impact of failures. However, using such mechanisms requires additional overhead. As such, failure prediction is needed in order to smartly utilize FT mechanisms. Hence, the proficiency of a failure prediction determines the efficiency of FT mechanism utilization. The proficiency of a failure predictor in HPC is usually designated by well-known error measurements, e.g. MSE, MAD, precision and recall, in which less error infers the greater proficiency. In this manuscript, we propose to view prediction proficiency from another aspect—lost computing time. We then discuss the insufficiency of error measurements as HPC failure prediction proficiency metrics from the aspect of lost computing time, and propose novel metrics that address these issues.
international parallel and distributed processing symposium | 2016
Benjamin A. Allan; Jim M. Brandt; Ann C. Gentile; Cory Lueninghoener; Nichamon Naksinehaboon; Boyana Norris; Narate Taerat
One of the hallmarks of previous HPCMASPA workshops has been their ability to bridge many aspects of monitoring large scale HPC resources, presenting papers about applications and systems, tools and theory, and everything in between. This year is no exception. The seven papers accepted to the 2016 workshop cover application profiling, automatic application instrumentation, and runtime instrumentation, as well as scalable monitoring system design, and discussions of system-level monitoring on some of the largest HPC platforms in the world. As a whole, these papers give deep insight into a broad range of topics that HPC researchers and practitioners regularly need to deal with.
international conference on cluster computing | 2015
James M. Brandt; Ann C. Gentile; Cindy Martin; Jason Repik; Narate Taerat
Disentangling significant and important log messages from those that are routine and unimportant can be a difficult task. Further, on a new system, understanding correlations between significant and possibly new types of messages and conditions that cause them can require significant effort and time. The initial standup of a machine can provide opportunities for investigating the parameter space of events and operations and thus for gaining insight into the events of interest. In particular, failure inducement and investigation of corner case conditions can provide knowledge of system behavior for significant issues that will enable easier diagnosis and mitigation of such issues for when they may actually occur during the platform lifetime. In this work, we describe the testing process and monitoring results from a testbed system in preparation for the ACES Trinity system. We describe how events in the initial standup including changes in configuration and software and corner case testing has provided insights that can inform future monitoring and operating conditions, both of our test systems and the eventual large-scale Trinity system.
international conference on cluster computing | 2009
Narate Taerat; Chokchai Leangsuksun
Transient failures in large-scale HPC systems are significantly increasing due to the large number of components. Fault tolerance mechanisms exist, but they cost additional overhead per invocation to application. Thus, failure prediction is needed in order to gracefully mitigate such events and to minimize the usage of mechanism. However, the proficiency metrics for HPC failure prediction are borrowed from other related fields, mainly from statistic, data mining and information theory. Some of them fit well in some perspective, but none of them consider the perspective of lost computing time due to the prediction error. Thus, we present the incompetence study in existing metrics and introduce additional metrics cope with potential lost computing time perspective to be used together with existing metrics and justifying HPC failure prediction proficiency.