Chokchai Leangsuksun | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chokchai Leangsuksun is active.

Explore More

Publication

Featured researches published by Chokchai Leangsuksun.

IEEE Computer | 1994

ASC: an associative-computing paradigm

Jerry L. Potter; Johnnie W. Baker; Stephen L. Scott; Arvind K. Bansal; Chokchai Leangsuksun; Chandra R. Asthagiri

Todays increased computing speeds allow conventional sequential machines to effectively emulate associative computing techniques. We present a parallel programming paradigm called ASC (ASsociative Computing), designed for a wide range of computing engines. Our paradigm has an efficient associative-based, dynamic memory-allocation mechanism that does not use pointers. It incorporates data parallelism at the base level, so that programmers do not have to specify low-level sequential tasks such as sorting, looping and parallelization. Our paradigm supports all of the standard data-parallel and massively parallel computing algorithms. It combines numerical computation (such as convolution, matrix multiplication, and graphics) with nonnumerical computing (such as compilation, graph algorithms, rule-based systems, and language interpreters). This article focuses on the nonnumerical aspects of ASC.<<ETX>>

international parallel and distributed processing symposium | 2008

An optimal checkpoint/restart model for a large scale high performance computing system

Yudan Liu; Raja Nassar; Chokchai Leangsuksun; Nichamon Naksinehaboon; Mihaela Paun; Stephen L. Scott

The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss (rollback and checkpoint overheads) due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims at addressing fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can deal with a varying checkpoint interval and with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.

cluster computing and the grid | 2008

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

Nichamon Naksinehaboon; Yudan Liu; Chokchai Leangsuksun; Raja Nassar; Mihaela Paun; Stephen L. Scott

For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model (Liu et al., 2007) on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.

availability, reliability and security | 2008

A Framework for Proactive Fault Tolerance

Geoffroy Vallée; Christian Engelmann; Anand Tikotekar; Thomas Naughton; K. Charoenpornwattana; Chokchai Leangsuksun; Stephen L. Scott

Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures and proactively take action before failures occur in order to minimize failure impact on the system and application execution. This document presents a proactive fault tolerance framework. This framework can use different proactive fault tolerance mechanisms, i.e., migration and pause/un-pause. The framework also allows the implementation of new proactive fault tolerance policies thanks to a modular architecture. A first proactive fault tolerance policy has been implemented and preliminary experimentations have been done based on system-level virtualization and compared with results obtained by simulation.

availability, reliability and security | 2006

Availability modeling and analysis on high performance cluster computing systems

Hertong Song; Chokchai Leangsuksun; Raja Nassar; Narasimha Raju Gottumukkala; Stephen L. Scott

Cluster computing has been attracting more and more attention from both the industry and the academia for its enormous computing power, cost effectiveness, and scalability. Availability is a key system attribute that needs to be considered both at system design stage and must reflect the actuality. System monitoring and logging enables identifying unplanned events to reflect the actual systems availability. This paper proposes a single framework that coordinates event monitoring, filtering, data analysis and dynamic availability modeling. The availability model is abstracted and categorized based on functionality. We describe the proposed architecture, and a sample analysis of real time event logs from a 512 node cluster from Lawrence Livermore National Laboratory.

Journal of Computers | 2006

Symmetric Active/Active High Availability for High-Performance Computing System Services

Christian Engelmann; Stephen L. Scott; Chokchai Leangsuksun; Xubin He

This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.

IEEE Transactions on Reliability | 2010

Reliability of a System of k Nodes for High Performance Computing Applications

Narasimha Raju Gottumukkala; Raja Nassar; Mihaela Paun; Chokchai Leangsuksun; Stephen L. Scott

Reliability estimation of High Performance Computing (HPC) systems enables resource allocation, and fault tolerance frameworks to minimize the performance loss due to unexpected failures. Recent studies have shown that compute nodes in HPC systems follow a time varying failure rate distribution such as Weibull, instead of the exponential distribution. In this paper, we propose a model for the Time to Failure (TTF) distribution of a system of k s-independent nodes when individual nodes exhibit time varying failure rates. We also present the system reliability, failure rates, Mean Time to Failure (MTTF), and derivations of the proposed system TTF model. The model is validated using observed data on time to failure.

international conference on cluster computing | 2007

A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

Yudan Liu; Raja Nassar; Chokchai Leangsuksun; Nichamon Naksinehaboon; Mihaela Paun; Stephen L. Scott

The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads. Our scheme aims to address fault tolerance challenge especially in a large-scale HPC system by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.

international conference on cluster computing | 2007

Evaluation of fault-tolerant policies using simulation

Anand Tikotekar; Geoffroy Vallée; Thomas Naughton; Stephen L. Scott; Chokchai Leangsuksun

Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechanisms are checkpoint/restart (for reactive FT) and migration (for pro-active FT). However, each of these mechanisms create an overhead on application execution, overhead that for instance becomes critical on large-scale systems where previous studies have shown that applications may spend more time checkpointing state than performing useful work. In order to decrease this overhead, researchers try to both optimize existing FT mechanisms and implement new FT policies. For instance, combining reactive and pro-active approaches in order to decrease the number of checkpoints that must be performed during the applications execution. However, currently no solutions exist which enable the evaluation of these FT approaches through simulation, instead experimentations must be done using real platforms. This increases complexity and limits experimentation into alternate solutions. This paper presents a simulation framework that evaluates different FT mechanisms and policies. The framework uses system failure logs for the simulation with a default behavior based on logs taken from the ASCI White at Lawrence Livermore National Laboratory. We evaluate the accuracy of our simulator comparing simulated results with those taken from experiments done on a 32-node compute cluster. Therefore such a simulator can be used to develop new FT policies and/or to tune existing policies.

cluster computing and the grid | 2006

Policy-Based Access Control Framework for Grid Computing

Jin Wu; Chokchai Leangsuksun; Vishal Rampure; Hong Ong

Grid technology enables access and sharing of data and computational resources across administrative domains. Thus, it is important to provide a uniform access and management mechanism couple with finegrain usage policies for enforcing authorization. In this paper, we describe our work on enabling finegrain access control for resource usage and management. We describe the prototype as well as the policy mark-up language that we designed to express fine-grain security policies. We then present our experimental results and discuss our plans for future work.

Explore More