Mihaela Paun | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mihaela Paun is active.

Explore More

Publication

Featured researches published by Mihaela Paun.

international parallel and distributed processing symposium | 2008

An optimal checkpoint/restart model for a large scale high performance computing system

Yudan Liu; Raja Nassar; Chokchai Leangsuksun; Nichamon Naksinehaboon; Mihaela Paun; Stephen L. Scott

The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss (rollback and checkpoint overheads) due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims at addressing fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can deal with a varying checkpoint interval and with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.

cluster computing and the grid | 2008

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

Nichamon Naksinehaboon; Yudan Liu; Chokchai Leangsuksun; Raja Nassar; Mihaela Paun; Stephen L. Scott

For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model (Liu et al., 2007) on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.

IEEE Transactions on Reliability | 2010

Reliability of a System of k Nodes for High Performance Computing Applications

Narasimha Raju Gottumukkala; Raja Nassar; Mihaela Paun; Chokchai Leangsuksun; Stephen L. Scott

Reliability estimation of High Performance Computing (HPC) systems enables resource allocation, and fault tolerance frameworks to minimize the performance loss due to unexpected failures. Recent studies have shown that compute nodes in HPC systems follow a time varying failure rate distribution such as Weibull, instead of the exponential distribution. In this paper, we propose a model for the Time to Failure (TTF) distribution of a system of k s-independent nodes when individual nodes exhibit time varying failure rates. We also present the system reliability, failure rates, Mean Time to Failure (MTTF), and derivations of the proposed system TTF model. The model is validated using observed data on time to failure.

international conference on cluster computing | 2007

A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

Yudan Liu; Raja Nassar; Chokchai Leangsuksun; Nichamon Naksinehaboon; Mihaela Paun; Stephen L. Scott

The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads. Our scheme aims to address fault tolerance challenge especially in a large-scale HPC system by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.

availability, reliability and security | 2012

An Economic Model for Maximizing Profit of a Cloud Service Provider

Thanadech Thanakornworakij; Raja Nassar; Chokchai Leangsuksun; Mihaela Paun

For Infrastructure-as-a-Service, Cloud service providers, such as Amazon EC2 and Rackspace, allow users to lease their computing resources over the Internet, and invest their money into developing and maintaining the infrastructure. Hence, maximizing profit, right pricing, and rightsizing are vital elements to their business. To address these issues, we propose in this article an economic model for cloud service providers that can be used to maximize profit based on right pricing and rightsizing in the Cloud data centre. Total cost is a key element in the model and it is analyzed by considering the Total Cost of Ownership (TCO) of the Cloud.

International Journal of Foundations of Computer Science | 2010

INCREMENTAL CHECKPOINT SCHEMES FOR WEIBULL FAILURE DISTRIBUTION

Mihaela Paun; Nichamon Naksinehaboon; Raja Nassar; Chokchai Leangsuksun; Stephen L. Scott; Narate Taerat

Incremental checkpoint mechanism was introduced to reduce high checkpoint overhead of regular (full) checkpointing, especially in high-performance computing systems. To gain an extra advantage from the incremental checkpoint technique, we propose an optimal checkpoint frequency function that globally minimizes the expected wasted time of the incremental checkpoint mechanism. Also, the re-computing time coefficient used to approximate the re-computing time is derived. Moreover, to reduce the complexity in the recovery state, full checkpoints are performed from time to time. In this paper we present an approach to evaluate the appropriate constant number of incremental checkpoints between two consecutive full checkpoints. Although the number of incremental checkpoints is constant, the checkpoint interval derived from the proposed model varies depending on the failure rate of the system. The checkpoint time is illustrated in the case of a Weibull distribution and can be easily simplified to the exponential case.

acm sigplan symposium on principles and practice of parallel programming | 2009

A tunable holistic resiliency approach for high-performance computing systems

Stephen L. Scott; Christian Engelmann; Geoffroy Vallée; Thomas Naughton; Anand Tikotekar; George Ostrouchov; Chokchai Leangsuksun; Nichamon Naksinehaboon; Raja Nassar; Mihaela Paun; Frank Mueller; Chao Wang; Arun Babu Nagarajan; Jyothish Varma

In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.

ieee international conference on high performance computing data and analytics | 2013

Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications

Thanadech Thanakornworakij; Raja Nassar; Chokchai Leangsuksun; Mihaela Paun

A high-performance computing (HPC) system, which is composed of a large number of components, is prone to failure. To maximize HPC system utilization, one should understand the failure behavior and the reliability of the system. Studies in the literature show that the time to failure of a node is best described by a Weibull distribution. In this study, we consider, without loss of generality, the Weibull as the distribution of time to failure and develop a reliability model for a system of k nodes where nodes can fail simultaneously. From this model, we develop expressions for the probability of failure of the system at any time t, for the failure rate, and for the mean time to failure. Also, we validate the model by using failure data from the Blue Gene/L logs obtained from the Lawrence Livermore National Laboratory. Results show that if failures of the components (nodes) in the system possess a degree of dependency, the system becomes less reliable, which means that the failure rate increases and the mean time to failure decreases. Also, an increase in the number of nodes decreases the reliability of the system.

2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications Workshops | 2011

The Effect of Correlated Failure on the Reliability of HPC Systems

Thanadech Thanakornworakij; Raja Nassar; Chokchai Leangsuksun; Mihaela Paun

High Performance Computing (HPC) system utilization can be maximized and sustained if one understands the failure behavior. In general, Time to Failure (TTF) of HPC systems has been long studied and showed that the Wei bull distribution gives the best fit. In addition, in many cases, TTF of such systems exhibit correlations. In our previous study, we developed a reliability model of an HPC system where failures among nodes are independent. However, some studies have clearly shown that in some cases nodes do not fail independently of one another. Therefore, it is of importance to develop a reliability model for an HPC system based on the occurrence of simultaneous failures. In this paper, we develop such a model and derive expressions for the probability density function of time to failure, system reliability, system failure rate, and mean time to failure (MTTF). Results show that if the failure of the components (nodes) in the system possesses a degree of dependency, the system reliability decreases.

The Journal of Supercomputing | 2014

Reliability-aware performance model for optimal GPU-enabled cluster environment

Supada Laosooksathit; Raja Nassar; Chokchai Leangsuksun; Mihaela Paun

Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed.

Explore More