Dave Dice
Oracle Corporation
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dave Dice.
acm symposium on parallel algorithms and architectures | 2014
Dave Dice; Alex Kogan; Yossi Lev; Timothy Merrifield; Mark S. Moir
Transactional Lock Elision (TLE) and optimistic software execution can both improve scalability of lock-based programs. The former uses hardware transactional memory (HTM) without requiring code changes; the latter involves modest code changes but does not require special hardware support. Numerous factors affect the choice of technique, including: critical section code, calling context, workload characteristics, and hardware support for synchronization. The ALE library integrates these techniques, and collects detailed, fine-grained performance data, enabling policies that decide between them at runtime for each critical section execution. We describe an adaptive policy and present experiments on three platforms, two of which support HTM, showing that---without tuning for specific platforms or workload---the adaptive policy is competitive with and often significantly better than hand-tuned static policies.
acm symposium on parallel algorithms and architectures | 2013
Dave Dice; Yossi Lev; Mark Moir
Statistics counters are important for purposes such as detecting excessively high rates of various system events, or for mechanisms that adapt based on event frequency. As systems grow and become increasingly NUMA, commonly used naive counters impose scalability bottlenecks and/or such inaccuracy that they are not useful. We present both precise and statistical (probabilistic) counters that are nonblocking and provide dramatically better scalability and accuracy properties. Crucially, these counters are competitive with the naive ones even when contention is low.
acm sigplan symposium on principles and practice of parallel programming | 2016
Dave Dice; Alex Kogan; Yossi Lev
Transactional lock elision (TLE) is a well-known technique that exploits hardware transactional memory (HTM) to introduce concurrency into lock-based software. It achieves that by attempting to execute a critical section protected by a lock in an atomic hardware transaction, reverting to the lock if these attempts fail. One significant drawback of TLE is that it disables hardware speculation once there is a thread running under lock. In this paper we present two algorithms that rely on existing compiler support for transactional programs and allow threads to speculate concurrently on HTM along with a thread holding the lock. We demonstrate the benefit of our algorithms over TLE and other related approaches with an in-depth analysis of a number of benchmarks and a wide range of workloads, including an AVL tree-based micro-benchmark and ccTSA, a real sequence assembler application.
Concurrency and Computation: Practice and Experience | 2015
Peter A. Buhr; Dave Dice; Willem H. Hesselink
Software solutions for mutual exclusion developed over a 30‐year period, starting with complex ad hoc algorithms and progressing to simpler formal ones. While it is easy to dismiss software solutions for mutual exclusion, as this family of algorithms is antiquated and most platforms support atomic hardware instructions, there is still a need for these algorithms in threaded, embedded systems running on low‐cost processors lacking atomic instructions. While N‐thread solutions are usually short (10–25 lines of code), each is ingenious with exceptionally subtle aspects, often making it difficult to prove correctness or construct an implementation. This work examines correctness and performance of the implementations. An extensive survey of existing algorithms is presented, with explanations of the intuition behind the algorithms and how they work. Several errors were found and corrections made, as well as a few small improvements, in the existing algorithms; two new high‐performance algorithms were developed. Finally, a worst‐case high‐contention performance experiment is performed to compare the algorithms and contrast them with three common locks based on hardware atomic instructions. The results show our two new algorithms are highly competitive with an equivalent hardware lock (Mellor‐Crummey and Scott) over a range of 1–32 processors. Hence, threading is a viable alternative to event‐driven programming for complex embedded systems without atomic instructions. Copyright
international symposium on memory management | 2016
Dave Dice; Maurice Herlihy; Alex Kogan
Current memory reclamation mechanisms for highly-concurrent data structures present an awkward trade-off. Techniques such as epoch-based reclamation perform well when all threads are running on dedicated processors, but the delay or failure of a single thread will prevent any other thread from reclaiming memory. Alternatives such as hazard pointers are highly robust, but they are expensive because they require a large number of memory barriers. This paper proposes three novel ways to alleviate the costs of the memory barriers associated with hazard pointers and related techniques. These new proposals are backward-compatible with existing code that uses hazard pointers. They move the cost of memory management from the principal code path to the infrequent memory reclamation procedure, significantly reducing or eliminating memory barriers executed on the principal code path. These proposals include (1) exploiting the operating systems memory protection ability, (2) exploiting certain x86 hardware features to trigger memory barriers only when needed, and (3) a novel hardware-assisted mechanism, called a hazard lookaside buffer (HLB) that allows a reclaiming thread to query whether there are hazardous pointers that need to be flushed to memory. We evaluate our proposals using a few fundamental data structures (linked lists and skiplists) and libcuckoo, a recent high-throughput hash-table library, and show significant improvements over the hazard pointer technique.
Concurrency and Computation: Practice and Experience | 2016
Peter A. Buhr; Dave Dice; Willem H. Hesselink
Dekkers algorithm was thought to be safe in an environment without atomic reads or writes where bits flicker or scramble during simultaneous operations. A counter‐example is presented showing Dekkers algorithm is unsafe without atomic read. A modification to the original algorithm is presented making it RW‐safe, allowing threaded systems to be built on low cost/power hardware without atomic read/write. Correctness is verified by means of invariants and UNITY logic. A performance comparison is made for several two‐thread software mutual‐exclusion algorithms to see if the RW‐safe Dekker is competitive. A subset of the two‐thread solutions are then compared in two N‐thread tournament algorithms. The performance results show that the additional checks in the RW‐safe Dekker do not disadvantage the algorithm in comparison with other two‐thread algorithms. The RW‐safe N‐thread tournament algorithms are competitive with the hardware‐assisted Mellor‐Crummey and Scott algorithm. Copyright
acm symposium on parallel algorithms and architectures | 2014
Dave Dice; Virendra J. Marathe; Nir Shavit
We describe a counter-intuitive performance phenomena relevant to concurrency research. On a modern multicore system with a shared last-level cache, a set of concurrently running identical threads that loop -- each accessing the same quantity of distinct thread-private data -- can suffer significant relative progress imbalance. If one thread, or a small subset of the threads, manages to transiently enjoy higher cache residency than the other threads, that thread will tend to iterate faster and keep more of its data resident, thus increasing the odds that it will continue to run faster. This emergent behavior tends to be stable over surprisingly long periods.
european conference on computer systems | 2017
Dave Dice
Applications running in modern multithreaded environments are sometimes overthreaded. The excess threads do not improve performance, and in fact may act to degrade performance via scalability collapse, which can manifest even when there are fewer ready threads than available cores. Often, such software also has highly contended locks. We leverage the existence of such locks by modifying the lock admission policy so as to intentionally limit the number of distinct threads circulating over the lock in a given period. Specifically, if there are more threads circulating than are necessary to keep the lock saturated (continuously held), our approach will selectively cull and passivate some of those excess threads. We borrow the concept of swapping from the field of memory management and impose concurrency restriction (CR) if a lock suffers from contention. The resultant admission order is unfair over the short term but we explicitly provide long-term fairness by periodically shifting threads between the set of passivated threads and those actively circulating. Our approach is palliative, but is often effective at avoiding or reducing scalability collapse, and in the worst case does no harm. Specifically, throughput is either unaffected or improved, and unfairness is bounded, relative to common test-and-set locks which allow unbounded bypass and starvation1. By reducing competition for shared resources, such as pipelines, processors and caches, concurrency restriction may also reduce overall resource consumption and improve the overall load carrying capacity of a system.
Concurrency and Computation: Practice and Experience | 2014
Dave Dice; Danny Hendler; Ilya Mirsky
Many concurrent data‐structure implementations – both blocking and non‐blocking – use the well‐known compare‐and‐swap (CAS) operation, supported in hardware by most modern multiprocessor architectures, for inter‐thread synchronization. A key weakness of the CAS operation is its performance in the presence of memory contention. When multiple threads concurrently attempt to apply CAS operations to the same shared variable, at most a single thread will succeed in changing the shared variables value and the CAS operations of all other threads will fail. Moreover, significant degradation in performance occurs when variables manipulated by CAS become contention ‘hot spots’, because failed CAS operations congest the interconnect and memory devices and slow down successful CAS operations. In this work, we study the following question: can software‐based contention management improve the efficiency of hardware‐provided CAS operations? In other words, can a software contention management layer, encapsulating invocations of hardware CAS instructions, improve the performance of CAS‐based concurrent data structures? To address this question, we conduct what is, to the best of our knowledge, the first study on the impact of contention management algorithms on the efficiency of the CAS operation. We implemented several Java classes, that extend Javas AtomicReference class, and encapsulate calls to the native CAS instruction with simple contention management mechanisms tuned for different hardware platforms. A key property of our algorithms is the support for an almost‐transparent interchange with Javas AtomicReference objects, used in implementations of concurrent data structures. We evaluate the impact of these algorithms on both a synthetic micro‐benchmark and on CAS‐based concurrent implementations of widely‐used data structures such as stacks and queues. Our performance evaluation establishes that lightweight software‐based contention management support can greatly improve performance under medium and high contention levels while typically incurring only small overhead under low contention. In some cases, applying efficient contention management for CAS operations used by a simpler data‐structure implementation yields better results than highly optimized implementations of the same data structure that use native CAS operations directly. Copyright
ACM Transactions on Architecture and Code Optimization | 2018
Dave Dice; Maurice Herlihy; Alex Kogan
Today’s hardware transactional memory (HTM) systems rely on existing coherence protocols, which implement a requester-wins strategy. This, in turn, leads to poor performance when transactions frequently conflict, causing them to resort to a non-speculative fallback path. Often, such a path severely limits parallelism. In this article, we propose very simple architectural changes to the existing requester-wins HTM implementations that enhance conflict resolution between hardware transactions and thus improve their parallelism. Our idea is compatible with existing HTM systems, requires no changes to target applications that employ traditional lock synchronization, and is shown to provide robust performance benefits.