Omer Subasi
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Omer Subasi.
computing frontiers | 2015
Omer Subasi; Javier Arias; Osman S. Unsal; Jesús Labarta; Adrián Cristal
In this work we propose partial task replication and checkpointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.
cluster computing and the grid | 2016
Omer Subasi; Sheng Di; Leonardo Bautista-Gomez; Prasanna Balaprakash; Osman S. Unsal; Jesús Labarta; Adrian Cristal; Franck Cappello
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.
international conference on cluster computing | 2015
Tatiana V. Martsinkevich; Omer Subasi; Osman S. Unsal; Franck Cappello; Jesús Labarta
We present a fault-tolerant protocol for task-parallel message-passing applications to mitigate transient errors. The protocol requires the restart only of the task that experienced the error and transparently handles any MPI calls inside the task. The protocol is implemented in Nanos -- a dataflow runtime for task-based OmpSs programming model -- and the PMPI profiling layer to fully support hybrid OmpSs+MPI applications. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks.
ieee acm international symposium cluster cloud and grid computing | 2017
Omer Subasi; Gulay Yalcin; Ferad Zyulkyarov; Osman S. Unsal; Jesús Labarta
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
high performance computing and communications | 2015
Omer Subasi; Ferad Zyulkyarov; Osman S. Unsal; Jesús Labarta
The state-of-the-art checkpointing techniques are projected to be prohibitively expensive in the Exascale era. These techniques are most often holistic in nature which prevents them to leverage programming model and paradigm specific advantages so as to be viable for the Exascale era. In this work, we present a unified non-hierarchical model to combine uncoordinated checkpointing with coordinated system-wide checkpointing to capitalize on programming model specific advantages. We develop closed-form formulas for performance improvement and the optimal checkpoint interval of the unified model in our analytical assessment. As an instantiation of our model, we propose to unify task-level checkpointing with a system-wide checkpointing scheme for task-parallel HPC applications. This instantiation has three distinct advantages: first it reduces performance overheads by decreasing the frequency of checkpoints in the unified system, second it features fast failure recovery by using in-memory task-local checkpoints instead of on-disk global checkpoints, and third it does not compromise from the high failure coverage typical of system-wide checkpointing.
International Journal of High Performance Computing Applications | 2018
Omer Subasi; Tatiana V. Martsinkevich; Ferad Zyulkyarov; Osman S. Unsal; Jesús Labarta; Franck Cappello
We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.
international conference on cluster computing | 2016
Omer Subasi; Gulay Yalcin; Ferad Zyulkyarov; Osman S. Unsal; Jesús Labarta
In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.
international parallel and distributed processing symposium | 2016
Omer Subasi; Osman S. Unsal; Jesús Labarta; Gulay Yalcin; Adrian Cristal
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the most commonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overhead with hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRC-based mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error correction capability.
international conference on cluster computing | 2017
Omer Subasi; Sriram Krishnamoorthy
In this paper, we present a non-parametric dataanalytic soft-error detector. Our detector uses the key properties of Gaussian process regression. First, because Gaussian process regression provides confidence on the prediction, this confidence can be used to automatize construction of the detection range. Second, because the correlation model of a Gaussian process captures the similarity among neighboring point values, only one-time online training is needed. This leads to very low online performance overheads. Finally, Gaussian process regression localizes the detection range computation, thereby avoiding communication costs. We compare our detector with the adaptive impact-driven (AID) and spatial supportvector- machine (SSD) detectors, two effective detectors based on observation of the temporal and spatial evolution of data, respectively. Experiments with five failure distributions and six real-world high-performance computing applications reveal that the Gaussian-process-based detector achieves low false positive rate and high recall while incurring less than 0.1% performance and memory overheads. Considering the detection performance and overheads, our Gaussian process detector provides the best trade-off.
international conference on cluster computing | 2017
Omer Subasi; Gokcen Kestor; Sriram Krishnamoorthy
Checkpoint/restart has been widely used to cope with fail-stop errors. The checkpointing frequency is most often optimized by assuming an exponential failure distribution. However, field studies show that most often failures do not follow a constant failure rate exponential distribution. Therefore, the optimal checkpointing frequency should be computed and tuned considering the different distributions that failures follow. Moreover, due to operating system and input/output jitter and hybrid solutions that combine checkpointing with other techniques, such as data compression, checkpointing time can no longer be assumed constant. Thus, time varying checkpointing time should be accounted for to realistically model the application execution.In this study, we develop a mathematical theory and model to optimize the checkpointing frequency with respect to arbitrary failure distributions while capturing time-dependent non-constant checkpointing time. We show that we can provide closed-form formulas for important failure distributions in most cases. By instantiating our model, we study and analyze 10 important failure distributions to obtain the optimal checkpointing frequency for these distributions. Experimental evaluation shows that our model is highly accurate and deviates from the simulations less than 1% on average.