Is this you? Create Your Porfile

Anurag Negi

Chalmers University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anurag Negi is active.

Explore More

Publication

Featured researches published by Anurag Negi.

international conference on supercomputing | 2011

ZEBRA: a data-centric, hybrid-policy hardware transactional memory design

Ruben Titos-Gil; Anurag Negi; Manuel E. Acacio; José M. García; Per Stenström

Hardware Transactional Memory (HTM) systems, in prior research, have either fixed policies of conflict resolution and data versioning for the entire system or allowed a degree of flexibility at the level of transactions. Unfortunately, this results in susceptibility to pathologies, lower average performance over diverse workload characteristics or high design complexity. In this work we explore a new dimension along which flexibility in policy can be introduced. Recognizing the fact that contention is more a property of data rather than that of an atomic code block, we develop an HTM system that allows selection of versioning and conflict resolution policies at the granularity of cache lines. We discover that this neat match in granularity with that of the cache coherence protocol results in a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload. It also brings together the benefits of parallel commits (inherent in traditional eager HTMs) and good optimistic concurrency without deadlock avoidance mechanisms (inherent in lazy HTMs), with little increase in complexity.

international conference on parallel architectures and compilation techniques | 2011

Pi-TM: Pessimistic Invalidation for Scalable Lazy Hardware Transactional Memory

Anurag Negi; Per Stenström; Ruben Titos-Gil; Manuel E. Acacio; José M. García

Lazy hardware transactional memory (HTM) al-lows better utilization of available concurrency in transactional workloads than eager HTM, but poses challenges at commit time due to the requirement of en-masse publication of speculative updates to global system state. Early conflictdetection can be employed in lazy HTM designs to allow non-conflicting transactions to commit in parallel. Though this has the potential to improve performance, it has not been utilized effectively so far. Prior work in the area burdens common-case transactional execution severely to avoid some relatively uncommon correctness concerns. In this work we investigate this problem and introduce a novel design, p-TM, which eliminates this problem. p-TM uses modest extensions to existing directory-based cache coherence protocols to keep a record of conflicting cache lines as a transaction executes. This information allows a consistent cache state to be maintained when transactions commit or abort. We observe that contention is typically seen only on a small fraction of shared data accessed by coarse-grained transactions. In p-TM earlyconflict detection mechanisms imply additional work only when such contention actually exists. Thus, the design is able to avoid expensive core-to-core and core-to-directory communication for a large part of transactionally accessed data. Our evalutation shows major performance gains when compared to other HTM designs in this class and competitive performance when compared to more complex lazy commit schemes.

international parallel and distributed processing symposium | 2014

Performance and Energy Analysis of the Restricted Transactional Memory Implementation on Haswell

Bhavishya Goel; Ruben Titos-Gil; Anurag Negi; Sally A. McKee; Per Stenström

Hardware transactional memory implementations are becoming increasingly available. For instance, the Intel Core i7 4770 implements Restricted Transactional Memory (RTM) support for Intel Transactional Synchronization Extensions (TSX). In this paper, we present a detailed evaluation of RTM performance and energy expenditure. We compare RTM behavior to that of the TinySTM software transactional memory system, first by running micro benchmarks, and then by running the STAMP benchmark suite. We find that which system performs better depends heavily on the workload characteristics. We then conduct a case study of two STAMP applications to assess the impact of programming style on RTM performance and to investigate what kinds of software optimizations can help overcome RTMs hardware limitations.

international conference on embedded computer systems: architectures, modeling, and simulation | 2010

LV ∗ : A low complexity lazy versioning HTM infrastructure

Anurag Negi; Mridha Mohammad Waliullah; Per Stenström

Transactional memory (TM) promises to unlock parallelism in software in a safer and easier way than lock-based approaches but the path to deployment is unclear for several reasons. First of all, since TM has not been deployed in any machine yet, experience of using it is limited. While software transactional memory implementations exist, they are too slow to provide useful experience. Existing hardware transactional memory implementations, on the other hand, can provide the efficiency required but they require a significant effort to integrate in cache coherence infrastructures or freeze critical policy parameters. This paper proposes the LV∗ (lazy versioning and eager/lazy conflict resolution) class of hardware transactional memory protocols. This class of protocols has been implemented with ease of deployment in mind. LV∗ can be integrated with low additional complexity in standard snoopy-cache MESI-protocols and can be accommodated in a directory-based cache coherence infrastructure. Since the optimal conflict resolution policy (lazy or eager) depends on transactional characteristics of workloads, LV∗ supports a set of conflict resolution policies that range from LazEr — a family of Lazy versioning Eager conflict resolution protocols — to LL-MESI which provides lazy resolution. We show that LV∗ can be hosted in a MESI protocol through straightforward extensions and that the flexibility in the choice of conflict resolution strategy has a significant impact on performance.

ieee international conference on high performance computing, data, and analytics | 2013

HARP: Adaptive abort recurrence prediction for Hardware Transactional Memory

Adrià Armejach; Anurag Negi; Adrian Cristal; Osman S. Unsal; Per Stenström; Tim Harris

Hardware Transactional Memory (HTM) exposes parallelism by allowing possibly conflicting sections of code, called transactions, to execute concurrently in multithreaded applications. However, conflicts among concurrent transactions result in wasted computation and expensive rollbacks. Under high contention HTM protocol overheads can, in many cases, amount to several times the useful work done. Blindly scheduling transactions in the presence of contention is therefore clearly suboptimal from a resource utilization standpoint, especially in situations where several scheduling options exist. This paper presents HARP (Hardware Abort Recurrence Predictor), a hardware-only mechanism to avoid speculation when it is likely to fail. Inspired by branch prediction strategies and prior work on contention management and scheduling in HTM, HARP uses past behavior of transactions and locality in conflicting memory references to accurately predict conflicts. The prediction mechanism adapts to variations in workload characteristics and enables better utilization of computational resources. We show that an HTM protocol that integrates HARP exhibits reductions in both wasted execution time and serialization overheads when compared to prior work, leading to a significant increase in throughput (~30%) in both single-application and multi-application scenarios.

international conference on parallel processing | 2013

Efficient Forwarding of Producer-Consumer Data in Task-Based Programs

Madhavan Manivannan; Anurag Negi; Per Stenström

Task-based programming models are increasingly being adopted due to their ability to express parallelism intuitively. This paper focuses on techniques to optimize producer-consumer sharing in task-based programs. As the set of producer and consumer tasks can often be statically determined, coherence prediction techniques are expected to successfully optimize producer-consumer sharing. We show however that they are ineffective because the mapping of tasks to cores changes based on run-time conditions. This paper contributes with a technique that forwards produced and spatially close blocks to a consumer in a single transaction when that consumer requests a first block. In comparison with prefetching approaches, such as stride prefetching, our proposed technique is a robust alternative to reduce communication overhead in fine-grained task-based applications.

IEEE Transactions on Parallel and Distributed Systems | 2013

Eager Beats Lazy: Improving Store Management in Eager Hardware Transactional Memory

Ruben Titos-Gil; Anurag Negi; Manuel E. Acacio; José M. García; Per Stenström

Hardware transactional memory (HTM) designs are very sensitive to the manner in which speculative updates from transactions are handled in the system. This study highlights how the lack of effective techniques for store management results in a quick degradation in the performance of eager HTM systems with increasing contention and, thus, lends credence to the belief that eager designs do not perform as well as their lazy counterparts when conflicts abound. In this work, we present two simple ways to improve handling of speculative stores--a way to effectively manage lines that exhibit migratory sharing and a way to hide store latency, particularly for those stores that target contended cache lines owned by other concurrent transactions. These two mechanisms yield substantial improvements in execution time when running applications with high contention, allowing eager designs to exceed the performance of lazy ones. Interestingly, the benefits that accrue from these enhancements can be at par with those achieved using more complex system-wide HTM techniques. Coupled with the fact that eager designs are easier to integrate into cache coherent architectures than lazy ones, we claim that with judicious management of stores they represent a more compelling design alternative.

ACM Transactions on Architecture and Code Optimization | 2013

Techniques to improve performance in requester-wins hardware transactional memory

Adrià Armejach; Ruben Titos-Gil; Anurag Negi; Osman S. Unsal; Adrián Cristal

The simplicity of requester-wins Hardware Transactional Memory (HTM) makes it easy to incorporate in existing chip multiprocessors. Hence, such systems are expected to be widely available in the near future. Unfortunately, these implementations are prone to suffer severe performance degradation due to transient and persistent livelock conditions. This article shows that existing techniques are unable to mitigate this degradation effectively. It then proposes and evaluates four novel techniques—two software-based that employ information provided by the hardware and two that require simple core-local hardware additions—which have the potential to boost the performance of requester-wins HTM designs substantially.

international conference on parallel processing | 2011

Eager Meets Lazy: The Impact of Write-Buffering on Hardware Transactional Memory

Anurag Negi; Ruben Titos-Gil; Manuel E. Acacio; José M. García; Per Stenström

Hardware transactional memory (HTM) systems have been studied extensively along the dimensions of speculative versioning and contention management policies. The relative performance of several designs policies has been discussed at length in prior work within the framework of scalable chip-multiprocessing systems. Yet, the impact of simple structural optimizations like write-buffering has not been investigated and performance deviations due to the presence or absence of these optimizations remains unclear. This lack of insight into the effective use and impact of these interfacial structures between the processor core and the coherent memory hierarchy forms the crux of the problem we study in this paper. Through detailed modeling of various write-buffering configurations we show that they play a major role in determining the overall performance of a practical HTM system. Our study of both eager and lazy conflict resolution mechanisms in a scalable parallel architecture notes a remarkable convergence of the performance of these two diametrically opposite design points when write buffers are introduced and used well to support the common case. Mitigation of redundant actions, fewer invalidations on abort, latency-hiding and prefetch effects contribute towards reducing execution times for transactions. Shorter transaction durations also imply a lower contention probability, thereby amplifying gains even further. The insights, related to the interplay between buffering mechanisms, system policies and workload characteristics, contained in this paper clearly distinguish gains in performance to be had from write-buffering from those that can be ascribed to HTM policy. We believe that this information would facilitate sound design decisions when incorporating HTMs into parallel architectures.

international conference on parallel architectures and compilation techniques | 2012

Transactional prefetching: narrowing the window of contention in hardware transactional memory

Anurag Negi; Adrià Armejach; Adrián Cristal; Osman S. Unsal; Per Stenström

Memory access latency is the primary performance bottleneck in modern computer systems. Prefetching data before it is needed by a processing core allows substantial performance gains by overlapping significant portions of memory latency with useful work. Prior work has investigated this technique and measured potential benefits in a variety of scenarios. However, its use in speeding up Hardware Transactional Memory (HTM) has remained hitherto unexplored. In several HTM designs transactions invalidate speculatively updated cache lines when they abort. Such cache lines tend to have high locality and are likely to be accessed again when the transaction re-executes. Coarse grained transactions that update several cache lines are particularly susceptible to performance degradation even under moderate contention. However, such transactions show strong locality of reference, especially when contention is high. Prefetching cache lines with high locality can, therefore, improve overall concurrency by speeding up transactions and, thereby, narrowing the window of time in which such transactions persist and can cause contention. Such transactions are important since they are likely to form a common TM use-case. We note that traditional prefetch techniques may not be able to track such lines adequately or issue prefetches quickly enough. This paper investigates the use of prefetching in HTMs, proposing a simple design to identify and request prefetch candidates, and measures performance gains to be had for several representative TM workloads.

Explore More