Yossi Lev | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yossi Lev is active.

Explore More

Publication

Featured researches published by Yossi Lev.

architectural support for programming languages and operating systems | 2011

Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory

Luke Dalessandro; François Carouge; Sean White; Yossi Lev; Mark S. Moir; Michael L. Scott; Michael F. Spear

Transactional memory (TM) is a promising synchronization mechanism for the next generation of multicore processors. Best-effort Hardware Transactional Memory (HTM) designs, such as Suns prototype Rock processor and AMDs proposed Advanced Synchronization Facility (ASF), can efficiently execute many transactions, but abort in some cases due to various limitations. Hybrid TM systems can use a compatible software TM (STM) in such cases.n We introduce a family of hybrid TMs built using the recent NOrec STM algorithm that, unlike existing hybrid approaches, provide both low overhead on hardware transactions and concurrent execution of hardware and software transactions. We evaluate implementations for Rock and ASF, exploring how the differing HTM designs affect optimization choices. Our investigation yields valuable input for designers of future best-effort HTMs.

acm sigplan symposium on principles and practice of parallel programming | 2013

NUMA-aware reader-writer locks

Irina Calciu; David Dice; Yossi Lev; Victor Luchangco; Virendra J. Marathe; Nir Shavit

Non-Uniform Memory Access (NUMA) architectures are gaining importance in mainstream computing systems due to the rapid growth of multi-core multi-chip machines. Extracting the best possible performance from these new machines will require us to revisit the design of the concurrent algorithms and synchronization primitives which form the building blocks of many of todays applications. This paper revisits one such critical synchronization primitive -- the reader-writer lock.n We present what is, to the best of our knowledge, the first family of reader-writer lock algorithms tailored to NUMA architectures. We present several variations which trade fairness between readers and writers for higher concurrency among readers and better back-to-back batching of writers from the same NUMA node. Our algorithms leverage the lock cohorting technique to manage synchronization between writers in a NUMA-friendly fashion, binary flags to coordinate readers and writers, and simple distributed reader counter implementations to enable NUMA-friendly concurrency among readers. The end result is a collection of surprisingly simple NUMA-aware algorithms that outperform the state-of-the-art reader-writer locks by up to a factor of 10 in our microbenchmark experiments. To evaluate our algorithms in a realistic setting we also present performance results of the kccachetest benchmark of the Kyoto-Cabinet distribution, an open-source database which makes heavy use of pthread reader-writer locks. Our locks boost the performance of kccachetest by up to 40% over the best prior alternatives.

acm symposium on parallel algorithms and architectures | 2010

Simplifying concurrent algorithms by exploiting hardware transactional memory

David Dice; Yossi Lev; Virendra J. Marathe; Mark S. Moir; Daniel S. Nussbaum; Marek Olszewski

We explore the potential of hardware transactional memory (HTM) to improve concurrent algorithms. We illustrate a number of use cases in which HTM enables significantly simpler code to achieve similar or better performance than existing algorithms for conventional architectures. We use Suns prototype multicore chip, code-named Rock, to experiment with these algorithms, and discuss ways in which its limitations prevent better results, or would prevent production use of algorithms even if they are successful. Our use cases include concurrent data structures such as double ended queues, work stealing queues and scalable non-zero indicators, as well as a scalable malloc implementation and a simulated annealing application. We believe that our paper makes a compelling case that HTM has substantial potential to make effective concurrent programming easier, and that we have made valuable contributions in guiding designers of future HTM features to exploit this potential.

principles of distributed computing | 2011

On the power of hardware transactional memory to simplify memory management

Aleksandar Dragojevic; Maurice Herlihy; Yossi Lev; Mark S. Moir

Dynamic memory management is a significant source of complexity in the design and implementation of practical concurrent data structures. We study how hardware transactional memory (HTM) can be used to simplify and streamline memory reclamation for such data structures. We propose and evaluate several new HTM-based algorithms for the Dynamic Collect problem that lies at the heart of many modern memory management algorithms. We demonstrate that HTM enables simpler and faster solutions, with better memory reclamation properties, than prior approaches. Despite recent theoretical arguments that HTM provides no worst-case advantages, our results support the claim that HTM can provide significantly better common-case performance, as well as reduced conceptual complexity.

acm symposium on parallel algorithms and architectures | 2014

Adaptive integration of hardware and software lock elision techniques

Dave Dice; Alex Kogan; Yossi Lev; Timothy Merrifield; Mark S. Moir

Transactional Lock Elision (TLE) and optimistic software execution can both improve scalability of lock-based programs. The former uses hardware transactional memory (HTM) without requiring code changes; the latter involves modest code changes but does not require special hardware support. Numerous factors affect the choice of technique, including: critical section code, calling context, workload characteristics, and hardware support for synchronization. The ALE library integrates these techniques, and collects detailed, fine-grained performance data, enabling policies that decide between them at runtime for each critical section execution. We describe an adaptive policy and present experiments on three platforms, two of which support HTM, showing that---without tuning for specific platforms or workload---the adaptive policy is competitive with and often significantly better than hand-tuned static policies.

acm symposium on parallel algorithms and architectures | 2013

Scalable statistics counters

Dave Dice; Yossi Lev; Mark Moir

Statistics counters are important for purposes such as detecting excessively high rates of various system events, or for mechanisms that adapt based on event frequency. As systems grow and become increasingly NUMA, commonly used naive counters impose scalability bottlenecks and/or such inaccuracy that they are not useful. We present both precise and statistical (probabilistic) counters that are nonblocking and provide dramatically better scalability and accuracy properties. Crucially, these counters are competitive with the naive ones even when contention is low.

acm sigplan symposium on principles and practice of parallel programming | 2016

Refined transactional lock elision

Dave Dice; Alex Kogan; Yossi Lev

Transactional lock elision (TLE) is a well-known technique that exploits hardware transactional memory (HTM) to introduce concurrency into lock-based software. It achieves that by attempting to execute a critical section protected by a lock in an atomic hardware transaction, reverting to the lock if these attempts fail. One significant drawback of TLE is that it disables hardware speculation once there is a thread running under lock. In this paper we present two algorithms that rely on existing compiler support for transactional programs and allow threads to speculate concurrently on HTM along with a thread holding the lock. We demonstrate the benefit of our algorithms over TLE and other related approaches with an in-depth analysis of a number of benchmarks and a wide range of workloads, including an AVL tree-based micro-benchmark and ccTSA, a real sequence assembler application.

acm symposium on parallel algorithms and architectures | 2016

Investigating the Performance of Hardware Transactions on a Multi-Socket Machine

Trevor Brown; Alex Kogan; Yossi Lev; Victor Luchangco

The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines. This paper investigates the behavior of hardware transactions on a large two-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.

Proceedings of the 4th International Workshop on Multicore Software Engineering | 2011

Lightweight parallel accumulators using C++ templates

Yossi Lev; Mark S. Moir

The Boost.Accumulators framework provides C++ template-based support for incremental computation of many important statistical functions, such as maximum, minimum, mean, count, variance, etc. Basic accumulators can be combined to build more sophisticated ones. We explore how this framework can be extended to implement lightweight parallel accumulators that allow multiple threads to Store sample data, and support concurrent GetResult operations that incrementally compute desired functions over the data. Our evaluation shows that our parallel accumulators are scalable and can effectively exploit programmer-supplied knowledge to achieve significant optimizations for some important cases.

acm sigplan symposium on principles and practice of parallel programming | 2013

Using hardware transactional memory to correct and simplify and readers-writer lock algorithm

David Dice; Yossi Lev; Yujie Liu; Victor Luchangco; Mark S. Moir

Designing correct synchronization algorithms is notoriously difficult, as evidenced by a bug we have identified that has apparently gone unnoticed in a well-known synchronization algorithm for nearly two decades. We use hardware transactional memory (HTM) to construct a corrected version of the algorithm. This version is significantly simpler than the original and furthermore improves on it by eliminating usage constraints and reducing space requirements. Performance of the HTM-based algorithm is competitive with the original in normal conditions, but it does suffer somewhat under heavy contention. We successfully apply some optimizations to help close this gap, but we also find that they are incompatible with known techniques for improving progress properties. We discuss ways in which future HTM implementations may address these issues. Finally, although our focus is on how effectively HTM can correct and simplify the algorithm, we also suggest bug fixes and workarounds that do not depend on HTM.

Explore More