Ibrahim Hur
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ibrahim Hur.
international conference on parallel architectures and compilation techniques | 2010
Ferad Zyulkyarov; Srdjan Stipic; Tim Harris; Osman S. Unsal; Adrian Cristal; Ibrahim Hur; Mateo Valero
Many researchers have developed applications using transactional memory (TM) with the purpose of benchmarking different implementations, and studying whether or not TM is easy to use. However, comparatively little has been done to provide general-purpose tools for profiling and tuning programs which use transactions.
international conference on performance engineering | 2011
Gokcen Kestor; Vasileios Karakostas; Osman S. Unsal; Adrian Cristal; Ibrahim Hur; Mateo Valero
Transactional Memory (TM) has been proposed as an alternative concurrency mechanism for the shared memory parallel programming model. Its main goal is to make parallel programming for Chip Multiprocessors (CMPs) easier than using the traditional lock synchronization constructs, without compromising the performance and the scalability. This topic has received substantial research attention and several TM designs have been proposed using various TM benchmarks. We believe that the evaluation of TM proposals would be more solid if it included realistic applications, that address on-going TM research issues, and that provide the potential for straightforward comparison against locks.n In this paper, we introduce RMS-TM, a Transactional Memory benchmark suite composed of seven real-world applications from the Recognition, Mining and Synthesis (RMS) domain. In addition to featuring current TM research issues such as nesting and I/O and system calls inside transactions, the RMS-TM applications also provide a mix of short and long transactions with small/large read and write sets with low/medium/high contention rates. These characteristics, as well as providing lock-based versions of the applications, make RMS-TM a useful TM tool. Current TM benchmarks do not explore all these features. In our evaluation with selected STM and HTM systems, we find that our benchmark suite is also scalable, which is useful for evaluating TM designs on high core counts.
international conference on parallel architectures and compilation techniques | 2011
Gokcen Kestor; Roberto Gioiosa; Tim Harris; Osman S. Unsal; Adrian Cristal; Ibrahim Hur; Mateo Valero
Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi-processor systems). Our results, performed on an IBM POWER7 machine, a state-of-the-art, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.
international conference on parallel architectures and compilation techniques | 2011
Adrià Armejach; Azam Seyedi; Ruben Titos-Gil; Ibrahim Hur; A. Cristal; Osman S. Unsal; Mateo Valero
Transactional Memory (TM) potentially simplifies parallel programming by providing atomicity and isolation for executed transactions. One of the key mechanisms to provide such properties is version management, which defines where and how transactional updates (new values) are stored. Version management can be implemented either eagerly or lazily. In Hardware Transactional Memory (HTM) implementations, eager version management puts new values in-place and old values are kept in a software log, while lazy version management stores new values in hardware buffers keeping old values in-place. Current HTM implementations, for both eager and lazy version management schemes, suffer from performance penalties due to the inability to handle two versions of the same logical data efficiently. In this paper, we introduce a reconfigurable L1 data cache architecture that has two execution modes: a 64KB general purpose mode and a 32KB TM mode which is able to manage two versions of the same logical data. The latter allows to handle old and new transactional values within the cache simultaneously when executing transactional workloads. We explain in detail the architectural design and internals of this Reconfigurable Data Cache (RDC), as well as the supported operations that allow to efficiently solve existing version management problems. We describe how the RDC can support both eager and lazy HTM systems, and we present two RDC-HTM designs. Our evaluation shows that the Eager-RDC-HTM and Lazy-RDC-HTM systems achieve 1.36x and 1.18x speedup, respectively, over state-of-the-art proposals. We also evaluate the area and energy effects of our proposal, and we find that RDC designs are 1.92x and 1.38x more energy-delay efficient compared to baseline HTM systems, with less than 0.3% area impact on modern processors.
international conference on parallel architectures and compilation techniques | 2011
Gulay Yalcin; Osman S. Unsal; Adrian Cristal; Ibrahim Hur; Mateo Valero
Fault-tolerance has become an essential concern for processor designers due to increasing transient and permanent fault rates. In this study we propose Symptom TM, a symptom-based error detection technique that recovers from errors by leveraging the abort mechanism of Transactional Memory (TM). To the best of our knowledge, this is the first architectural fault-tolerance proposal using Hardware Transactional Memory (HTM). Symptom TM can recover from 86% and 65% of catastrophic failures caused by transient and permanent errors respectively with no performance overhead in error-free executions.
great lakes symposium on vlsi | 2011
Azam Seyedi; Adrià Armejach; Adrián Cristal; Osman S. Unsal; Ibrahim Hur; Mateo Valero
This paper proposes a novel L1 data cache design with dual-versioning SRAM cells (dvSRAM) for chip multi-processors (CMP) that implement optimistic concurrency proposals. In this new cache architecture, each dvSRAM cell has two cells, a main cell and a secondary cell, which keep two versions of the same data. These values can be accessed, modified, moved back and forth between the main and secondary cells within the access time of the cache. We design and simulate a 32-KB dual-versioning L1 data cache with 45-nm CMOS technology at 2GHz processor frequency and 1V supply voltage, which we describe in detail. We also introduce three well-known use cases that make use of optimistic concurrency execution and that can benefit from our proposed design. Moreover, we evaluate one of the use cases to show the impact of the dual-versioning cell in both performance and energy consumption. Our experiments show that large speedups can be achieved with acceptable overall energy dissipation.
International Journal of Parallel Programming | 2012
Ferad Zyulkyarov; Srdjan Stipic; Tim Harris; Osman S. Unsal; Adrián Cristal; Ibrahim Hur; Mateo Valero
Many researchers have developed applications using transactional memory (TM) with the purpose of benchmarking different implementations, and studying whether or not TM is easy to use. However, comparatively little has been done to provide general-purpose tools for profiling and optimizing programs which use transactions. In this paper we introduce a series of profiling and optimization techniques for TM applications. The profiling techniques are of three types: (i) techniques to identify multiple potential conflicts from a single program run, (ii) techniques to identify the data structures involved in conflicts by using a symbolic path through the heap, rather than a machine address, and (iii) visualization techniques to summarize how threads spend their time and which of their transactions conflict most frequently. Altogether they provide in-depth and comprehensive information about the wasted work caused by aborting transactions. To reduce the contention between transactions we suggest several TM specific optimizations which leverage nested transactions, transaction checkpoints, early release and etc. To examine the effectiveness of the profiling and optimization techniques, we provide a series of illustrations from the STAMP TM benchmark suite and from the synthetic WormBench workload. First we analyze the performance of TM applications using our profiling techniques and then we apply various optimizations to improve the performance of the Bayes, Labyrinth and Intruder applications. We discuss the design and implementation of the profiling techniques in the Bartok-STM system. We process data offline or during garbage collection, where possible, in order to minimize the probe effect introduced by profiling.
applied reconfigurable computing | 2011
Nehir Sonmez; Oriol Arcas; Gokhan Sayilar; Osman S. Unsal; Adrian Cristal; Ibrahim Hur; Satnam Singh; Mateo Valero
In this paper, we take a MIPS-based open-source uniprocessor soft core, Plasma, and extend it to obtain the Beefarm infrastructure for FPGA-based multiprocessor emulation, a popular research topic of the last few years both in the FPGA and the computer architecture communities. We discuss various design tradeoffs and we demonstrate superior scalability through experimental results compared to traditional software instruction set simulators. Based on our experience of designing and building a complete FPGA-based multiprocessor emulation system that supports run-time and compiler infrastructure and on the actual executions of our experiments running Software Transactional Memory (STM) benchmarks, we comment on the pros, cons and future trends of using hardware-based emulation for research.
field-programmable custom computing machines | 2011
Nehir Sonmez; Oriol Arcas; Otto Pflucker; Osman S. Unsal; A. Cristal; Ibrahim Hur; Satnam Singh; Mateo Valero
In this paper we present the design and implementation of TM box: An MPSoC built to explore trade-offs in multicore design space and to evaluate parallel programming proposals such as Transactional Memory (TM). Our flexible system, comprised of MIPS R3000-compatible cores is easily modifiable to study different architecture, library and operating system extensions. For this paper we evaluate a 16-core Hybrid Transactional Memory implementation based on the Tiny STM-ASF proposal on a Virtex-5 FPGA and we accelerate three benchmarks written to investigate TM.
Integration | 2012
Azam Seyedi; Adrií Armejach; Adrian Cristal; Osman S. Unsal; Ibrahim Hur; Mateo Valero
This paper proposes a novel L1 data cache design with dual-versioning SRAM cells (dvSRAM) for chip multi-processors that implement optimistic concurrency proposals. In this cache architecture, each dvSRAM cell has two cells, a main cell and a secondary cell, which keep two versions of the same logical data. These values can be accessed, modified, moved back and forth between the main and secondary cells within the access time of the cache. We design and simulate a 32KB dual-versioning L1 data cache and introduce three well-known use cases that make use of optimistic concurrency execution that can benefit from our proposed design.