Ferad Zyulkyarov | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ferad Zyulkyarov is active.

Explore More

Publication

Featured researches published by Ferad Zyulkyarov.

acm sigplan symposium on principles and practice of parallel programming | 2009

Atomic quake: using transactional memory in an interactive multiplayer game server

Ferad Zyulkyarov; Vladimir Gajinov; Osman S. Unsal; Adrián Cristal; Eduard Ayguadé; Tim Harris; Mateo Valero

Transactional Memory (TM) is being studied widely as a new technique for synchronizing concurrent accesses to shared memory data structures for use in multi-core systems. Much of the initial work on TM has been evaluated using microbenchmarks and application kernels; it is not clear whether conclusions drawn from these workloads will apply to larger systems. In this work we make the first attempt to develop a large, complex, application that uses TM for all of its synchronization. We describe how we have taken an existing parallel implementation of the Quake game server and restructured it to use transactions. In doing so we have encountered examples where transactions simplify the structure of the program. We have also encountered cases where using transactions occludes the structure of the existing code. Compared with existing TM benchmarks, our workload exhibits non-block-structured transactions within which there are I/Ooperations and system call invocations. There are long and short running transactions (200-1.3M cycles) with small and large read and write sets (a few bytes to 1.5MB). There are nested transactions reaching up to 9 levels at runtime. There are examples where error handling and recovery occurs inside transactions. There are also examples where data changes between being accessed transactionally and accessed non-transactionally. However, we did not see examples where the kind of access to one piece of data depended on the value of another.

international conference on parallel architectures and compilation techniques | 2010

Discovering and understanding performance bottlenecks in transactional applications

Ferad Zyulkyarov; Srdjan Stipic; Tim Harris; Osman S. Unsal; Adrian Cristal; Ibrahim Hur; Mateo Valero

memory performance dealing with applications systems and architecture | 2008

WormBench: a configurable workload for evaluating transactional memory systems

Ferad Zyulkyarov; Adrián Cristal; Sanja Cvijic; Eduard Ayguadé; Mateo Valero; Osman S. Unsal; Tim Harris

Transactional Memory (TM) is a promising new technology that makes it possible to ease writing multi-threaded applications. Many different TM implementations exist, unfortunately most of those TM systems are currently evaluated by using workloads that are (1) tightly coupled to the interface of a particular TM implementation, (2) are small and lack to capture the common concurrency problems that exist in real multi-threaded applications and also (3) fail to evaluate the overall behavior of the Transactional Memory considering the complete software stack. WormBench is parameterized workload designed from the ground up to evaluate TM systems in terms of robustness and performance. Its goal is to provide an unified solution to the problems stated above (1, 2, 3). The critical sections in the code are marked with the atomic statements and thus proving a framework to test the compilers ability to translate them properly and efficiently into the appropriate TM system interface. Its design considers the common synchronization problems that exist in TM multi-threaded applications. The behavior of WormBench can be changed by using run configurations which provide the ability to reproduce a runtime behavior observed in a typical multi-threaded application or a behavior that stresses a particular aspect of the TM system such as abort handling. In this paper, we analyze the transactional characteristics of WormBench by studying different run configurations and demonstrate how Worm-Bench can be configured to model the transactional behavior of an application from the STAMP benchmark suite.

ieee international conference on high performance computing data and analytics | 2016

Unprotected computing: a large-scale study of DRAM raw error rate on a supercomputer

Leonardo Bautista-Gomez; Ferad Zyulkyarov; Osman S. Unsal; Simon N McIntosh-Smith

Supercomputers offer new opportunities for scientific computing as they grow in size. However, their growth also poses new challenges. Resilience has been recognized as one of the most pressing issues to solve for extreme scale computing. Transistor scaling in the single-digit nanometer era and power constraints might dramatically increase the failure rate of next generation machines. DRAM errors have been analyzed in the past for different supercomputers but those studies are usually based on job scheduler logs and counters produced by hardware-level error correcting codes. Consequently, little is known about errors escaping hardware checks, which lead to silent data corruption. This work attempts to fill that gap by analyzing memory errors for over a year on a cluster with about 1000 nodes featuring low-power memory without error correction. The study gathered millions of events recording detailed information of thousands of memory errors, many of them corrupting multiple bits. Several factors are analyzed, such as temporal and spatial correlation between errors, but also the influence of temperature and even the position of the sun in the sky. The study showed that most multi-bit errors corrupted non-adjacent bits in the memory word and that most errors flipped memory bits from 1 to 0. In addition, we observed thousands of cases of multiple single-bit errors occurring simultaneously in different regions of the memory. These new observations would not be possible by simply analyzing error correction counters on classical systems. We propose several directions in which the findings of this study can help the design of more reliable systems in the future.

ieee acm international symposium cluster cloud and grid computing | 2017

Designing and Modelling Selective Replication for Fault-tolerant HPC Applications

Omer Subasi; Gulay Yalcin; Ferad Zyulkyarov; Osman S. Unsal; Jesús Labarta

Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.

International Journal of Parallel Programming | 2012

Profiling and Optimizing Transactional Memory Applications

Ferad Zyulkyarov; Srdjan Stipic; Tim Harris; Osman S. Unsal; Adrián Cristal; Ibrahim Hur; Mateo Valero

Many researchers have developed applications using transactional memory (TM) with the purpose of benchmarking different implementations, and studying whether or not TM is easy to use. However, comparatively little has been done to provide general-purpose tools for profiling and optimizing programs which use transactions. In this paper we introduce a series of profiling and optimization techniques for TM applications. The profiling techniques are of three types: (i) techniques to identify multiple potential conflicts from a single program run, (ii) techniques to identify the data structures involved in conflicts by using a symbolic path through the heap, rather than a machine address, and (iii) visualization techniques to summarize how threads spend their time and which of their transactions conflict most frequently. Altogether they provide in-depth and comprehensive information about the wasted work caused by aborting transactions. To reduce the contention between transactions we suggest several TM specific optimizations which leverage nested transactions, transaction checkpoints, early release and etc. To examine the effectiveness of the profiling and optimization techniques, we provide a series of illustrations from the STAMP TM benchmark suite and from the synthetic WormBench workload. First we analyze the performance of TM applications using our profiling techniques and then we apply various optimizations to improve the performance of the Bayes, Labyrinth and Intruder applications. We discuss the design and implementation of the profiling techniques in the Bartok-STM system. We process data offline or during garbage collection, where possible, in order to minimize the probe effect introduced by profiling.

high performance computing and communications | 2015

Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era

Omer Subasi; Ferad Zyulkyarov; Osman S. Unsal; Jesús Labarta

The state-of-the-art checkpointing techniques are projected to be prohibitively expensive in the Exascale era. These techniques are most often holistic in nature which prevents them to leverage programming model and paradigm specific advantages so as to be viable for the Exascale era. In this work, we present a unified non-hierarchical model to combine uncoordinated checkpointing with coordinated system-wide checkpointing to capitalize on programming model specific advantages. We develop closed-form formulas for performance improvement and the optimal checkpoint interval of the unified model in our analytical assessment. As an instantiation of our model, we propose to unify task-level checkpointing with a system-wide checkpointing scheme for task-parallel HPC applications. This instantiation has three distinct advantages: first it reduces performance overheads by decreasing the frequency of checkpoints in the unified system, second it features fast failure recovery by using in-memory task-local checkpoints instead of on-disk global checkpoints, and third it does not compromise from the high failure coverage typical of system-wide checkpointing.

design, automation, and test in europe | 2012

TagTM - accelerating STMs with hardware tags for fast meta-data access

Sr dstrok; an Stipić; Sasa Tomic; Ferad Zyulkyarov; Adrián Cristal; Osman S. Unsal; Mateo Valero

In this paper we introduce TagTM, a Software Transactional Memory (STM) system augmented with a new hardware mechanism that we call GTags. GTags are new hardware cache coherent tags that are used for fast meta-data access. TagTM uses GTags to reduce the cost associated with accesses to the transactional data and corresponding metadata. For the evaluation of TagTM, we use the STAMP TM benchmark suite. In the average case TagTM provides a speedup of 7-15% (across all STAMP applications), and in the best case shows up to 52% speedup of committed transaction execution time (for SSCA2 application).

International Journal of High Performance Computing Applications | 2018

Unified fault-tolerance framework for hybrid task-parallel message-passing applications

Omer Subasi; Tatiana V. Martsinkevich; Ferad Zyulkyarov; Osman S. Unsal; Jesús Labarta; Franck Cappello

We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.

international conference on cluster computing | 2016

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets

Omer Subasi; Gulay Yalcin; Ferad Zyulkyarov; Osman S. Unsal; Jesús Labarta

In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.

Explore More