Ruben Titos-Gil
Chalmers University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ruben Titos-Gil.
international conference on parallel architectures and compilation techniques | 2011
Anurag Negi; Per Stenström; Ruben Titos-Gil; Manuel E. Acacio; José M. García
Lazy hardware transactional memory (HTM) al-lows better utilization of available concurrency in transactional workloads than eager HTM, but poses challenges at commit time due to the requirement of en-masse publication of speculative updates to global system state. Early conflictdetection can be employed in lazy HTM designs to allow non-conflicting transactions to commit in parallel. Though this has the potential to improve performance, it has not been utilized effectively so far. Prior work in the area burdens common-case transactional execution severely to avoid some relatively uncommon correctness concerns. In this work we investigate this problem and introduce a novel design, p-TM, which eliminates this problem. p-TM uses modest extensions to existing directory-based cache coherence protocols to keep a record of conflicting cache lines as a transaction executes. This information allows a consistent cache state to be maintained when transactions commit or abort. We observe that contention is typically seen only on a small fraction of shared data accessed by coarse-grained transactions. In p-TM earlyconflict detection mechanisms imply additional work only when such contention actually exists. Thus, the design is able to avoid expensive core-to-core and core-to-directory communication for a large part of transactionally accessed data. Our evalutation shows major performance gains when compared to other HTM designs in this class and competitive performance when compared to more complex lazy commit schemes.
international parallel and distributed processing symposium | 2014
Bhavishya Goel; Ruben Titos-Gil; Anurag Negi; Sally A. McKee; Per Stenström
Hardware transactional memory implementations are becoming increasingly available. For instance, the Intel Core i7 4770 implements Restricted Transactional Memory (RTM) support for Intel Transactional Synchronization Extensions (TSX). In this paper, we present a detailed evaluation of RTM performance and energy expenditure. We compare RTM behavior to that of the TinySTM software transactional memory system, first by running micro benchmarks, and then by running the STAMP benchmark suite. We find that which system performs better depends heavily on the workload characteristics. We then conduct a case study of two STAMP applications to assess the impact of programming style on RTM performance and to investigate what kinds of software optimizations can help overcome RTMs hardware limitations.
international conference on parallel architectures and compilation techniques | 2011
Adrià Armejach; Azam Seyedi; Ruben Titos-Gil; Ibrahim Hur; A. Cristal; Osman S. Unsal; Mateo Valero
Transactional Memory (TM) potentially simplifies parallel programming by providing atomicity and isolation for executed transactions. One of the key mechanisms to provide such properties is version management, which defines where and how transactional updates (new values) are stored. Version management can be implemented either eagerly or lazily. In Hardware Transactional Memory (HTM) implementations, eager version management puts new values in-place and old values are kept in a software log, while lazy version management stores new values in hardware buffers keeping old values in-place. Current HTM implementations, for both eager and lazy version management schemes, suffer from performance penalties due to the inability to handle two versions of the same logical data efficiently. In this paper, we introduce a reconfigurable L1 data cache architecture that has two execution modes: a 64KB general purpose mode and a 32KB TM mode which is able to manage two versions of the same logical data. The latter allows to handle old and new transactional values within the cache simultaneously when executing transactional workloads. We explain in detail the architectural design and internals of this Reconfigurable Data Cache (RDC), as well as the supported operations that allow to efficiently solve existing version management problems. We describe how the RDC can support both eager and lazy HTM systems, and we present two RDC-HTM designs. Our evaluation shows that the Eager-RDC-HTM and Lazy-RDC-HTM systems achieve 1.36x and 1.18x speedup, respectively, over state-of-the-art proposals. We also evaluate the area and energy effects of our proposal, and we find that RDC designs are 1.92x and 1.38x more energy-delay efficient compared to baseline HTM systems, with less than 0.3% area impact on modern processors.
IEEE Transactions on Parallel and Distributed Systems | 2013
Ruben Titos-Gil; Anurag Negi; Manuel E. Acacio; José M. García; Per Stenström
Hardware transactional memory (HTM) designs are very sensitive to the manner in which speculative updates from transactions are handled in the system. This study highlights how the lack of effective techniques for store management results in a quick degradation in the performance of eager HTM systems with increasing contention and, thus, lends credence to the belief that eager designs do not perform as well as their lazy counterparts when conflicts abound. In this work, we present two simple ways to improve handling of speculative stores--a way to effectively manage lines that exhibit migratory sharing and a way to hide store latency, particularly for those stores that target contended cache lines owned by other concurrent transactions. These two mechanisms yield substantial improvements in execution time when running applications with high contention, allowing eager designs to exceed the performance of lazy ones. Interestingly, the benefits that accrue from these enhancements can be at par with those achieved using more complex system-wide HTM techniques. Coupled with the fact that eager designs are easier to integrate into cache coherent architectures than lazy ones, we claim that with judicious management of stores they represent a more compelling design alternative.
ACM Transactions on Architecture and Code Optimization | 2013
Adrià Armejach; Ruben Titos-Gil; Anurag Negi; Osman S. Unsal; Adrián Cristal
The simplicity of requester-wins Hardware Transactional Memory (HTM) makes it easy to incorporate in existing chip multiprocessors. Hence, such systems are expected to be widely available in the near future. Unfortunately, these implementations are prone to suffer severe performance degradation due to transient and persistent livelock conditions. This article shows that existing techniques are unable to mitigate this degradation effectively. It then proposes and evaluates four novel techniques—two software-based that employ information provided by the hardware and two that require simple core-local hardware additions—which have the potential to boost the performance of requester-wins HTM designs substantially.
international conference on parallel processing | 2011
Anurag Negi; Ruben Titos-Gil; Manuel E. Acacio; José M. García; Per Stenström
Hardware transactional memory (HTM) systems have been studied extensively along the dimensions of speculative versioning and contention management policies. The relative performance of several designs policies has been discussed at length in prior work within the framework of scalable chip-multiprocessing systems. Yet, the impact of simple structural optimizations like write-buffering has not been investigated and performance deviations due to the presence or absence of these optimizations remains unclear. This lack of insight into the effective use and impact of these interfacial structures between the processor core and the coherent memory hierarchy forms the crux of the problem we study in this paper. Through detailed modeling of various write-buffering configurations we show that they play a major role in determining the overall performance of a practical HTM system. Our study of both eager and lazy conflict resolution mechanisms in a scalable parallel architecture notes a remarkable convergence of the performance of these two diametrically opposite design points when write buffers are introduced and used well to support the common case. Mitigation of redundant actions, fewer invalidations on abort, latency-hiding and prefetch effects contribute towards reducing execution times for transactions. Shorter transaction durations also imply a lower contention probability, thereby amplifying gains even further. The insights, related to the interplay between buffering mechanisms, system policies and workload characteristics, contained in this paper clearly distinguish gains in performance to be had from write-buffering from those that can be ascribed to HTM policy. We believe that this information would facilitate sound design decisions when incorporating HTMs into parallel architectures.
parallel, distributed and network-based processing | 2012
Epifanio Gaona; Ruben Titos-Gil; Manuel E. Acacio; Juan Fern´ndez
In the search for new paradigms to simplify multithreaded programming, Transactional Memory (TM) is currently being advocated as a promising alternative to deadlock-prone lock-based synchronization. In this way, future many-core CMP architectures may need to provide hardware support for TM. On the other hand, power dissipation constitutes a first class consideration in multicore processor designs. In this work, we propose Dynamic Serialization (DS) as a new technique to improve energy consumption without degrading performance in applications with conflicting transactions. Our proposal, which is implemented on top of a hardware transactional memory system with an eager conflict management policy, detects and serializes conflicting transactions dynamically. Particularly, in case of conflict one transaction is allowed to continue whilst the rest are completely stalled. Once the executing transaction has finished it wakes up several of the stalling transactions. This brings important benefits in terms of energy consumption due to the reduction in the amount of wasted work that DS implies. Results for a 16-core CMP show that Dynamic Serialization obtains reductions of 10% on average in energy consumption (more than 20% in high contention scenarios) without affecting, on average, execution time.
international conference on parallel processing | 2015
Ruben Titos-Gil; Oscar Palomar; Osman S. Unsal; Adrian Cristal
Thanks to programming approaches like actor-based models, message passing is regaining popularity outside large-scale scientific computing for building scalable distributed applications in many-core processors. Unfortunately, the mismatch between message passing models and todays shared-memory hardware provided by commercial vendors results in suboptimal performance and loss of efficiency. This paper presents a set of architectural extensions to reduce the overheads incurred by message passing workloads running on shared memory multi-core architectures. It describes the instruction set extensions and the hardware implementation. In order to facilitate programmability, the proposed extensions are used by a message passing library, allowing programs to take advantage of them transparently. As a proof-of-concept, we use a modified MPICH library and MPI programs to evaluate the proposal. Experimental results show that, on average, our proposal spends 60% less cycles performing data transfers in MPI functions, and reduces the L1 data cache misses in said functions to a fourth.
IEEE Transactions on Parallel and Distributed Systems | 2014
Ruben Titos-Gil; Anurag Negi; Manuel E. Acacio; José M. García; Per Stenström
Transactional contention management policies show considerable variation in relative performance with changing workload characteristics. Consequently, incorporation of fixed-policy Transactional Memory (TM) in general purpose computing systems is suboptimal by design and renders such systems susceptible to pathologies. Of particular concern are Hardware TM (HTM) systems where traditional designs have hardwired policies in silicon. Adaptive HTMs hold promise, but pose major challenges in terms of design and verification costs. In this paper, we present the ZEBRA HTM design, which lays down a simple yet high-performance approach to implement adaptive contention management in hardware. Prior work in this area has associated contention with transactional code blocks. However, we discover that by associating contention with data (cache blocks) accessed by transactional code rather than the code block itself, we achieve a neat match in granularity with that of the cache coherence protocol. This leads to a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload. ZEBRA, therefore, brings together the inherent benefits of traditional eager HTMs-parallel commits-and lazy HTMs-good optimistic concurrency without deadlock avoidance mechanisms-, combining them into a low-complexity design.
IEEE Transactions on Parallel and Distributed Systems | 2013
Ruben Titos-Gil; Manuel E. Acacio; José M. García
The efficient management of conflicts among concurrent transactions constitutes a key aspect that hardware transactional memory (HTM) systems must achieve. Scalable HTM proposals so far inherit the cache-based style of conflict detection typically found in bus-based systems, largely unaware of the interactions between transactions and directory coherence. In this paper, we demonstrate that the traditional approach of detecting conflicts at the private cache levels is inefficient when used in the context of a directory protocol. We find that the use of the directory as a mere router of coherence requests restricts the throughput of conflict detection, and show how it becomes a bottleneck under high contention. This paper proposes a scheme for conflict detection that decouples conflict detection from cache coherence in order to overcome pathological situations that degrade the performance of an eager HTM system. Our scheme places bookkeeping metadata at the directory, introducing it as a separate hardware module that leaves the coherence protocol unmodified. In comparison to a state-of-the-art eager HTM system, our design handles contention more efficiently, minimizes the performance degradation of false positives for signatures of similar hardware cost, and reduces the network traffic generated.