Danny Hendler
Ben-Gurion University of the Negev
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Danny Hendler.
acm symposium on parallel algorithms and architectures | 2010
Danny Hendler; Itai Incze; Nir Shavit; Moran Tzafrir
In-memory data management systems, such as key-value stores, have become an essential infrastructure in todays big-data processing and cloud computing. They rely on efficient index structures to access data. While unordered indexes, such as hash tables, can perform point search with O(1) time, they cannot be used in many scenarios where range queries must be supported. Many ordered indexes, such as B+ tree and skip list, have a O(log N) lookup cost, where N is number of keys in an index. For an ordered index hosting billions of keys, it may take more than 30 key-comparisons in a lookup, which is an order of magnitude more expensive than that on a hash table. With availability of large memory and fast network in todays data centers, this O(log N) time is taking a heavy toll on applications that rely on ordered indexes. In this paper we introduce a new ordered index structure, named Wormhole, that takes O(log L) worst-case time for looking up a key with a length of L. The low cost is achieved by simultaneously leveraging strengths of three indexing structures, namely hash table, prefix tree, and B+ tree, to orchestrate a single fast ordered index. Wormholes range operations can be performed by a linear scan of a list after an initial lookup. This improvement of access efficiency does not come at a price of compromised space efficiency. Instead, Wormholes index space is comparable to those of B+ tree and skip list. Experiment results show that Wormhole outperforms skip list, B+ tree, ART, and Masstree by up to 8.4x, 4.9x, 4.3x, and 6.6x in terms of key lookup throughput, respectively.Traditional data structure designs, whether lock-based or lock-free, provide parallelism via fine grained synchronization among threads. We introduce a new synchronization paradigm based on coarse locking, which we call flat combining. The cost of synchronization in flat combining is so low, that having a single thread holding a lock perform the combined access requests of all others, delivers, up to a certain non-negligible concurrency level, better performance than the most effective parallel finely synchronized implementations. We use flat-combining to devise, among other structures, new linearizable stack, queue, and priority queue algorithms that greatly outperform all prior algorithms.
Journal of Parallel and Distributed Computing | 2010
Danny Hendler; Nir Shavit; Lena Yerushalmi
The literature describes two high performance concurrent stack algorithms based on combining funnels and elimination trees. Unfortunately, the funnels are linearizable but blocking, and the elimination trees are non-blocking but not linearizable. Neither is used in practice since they perform well only at exceptionally high loads. The literature also describes a simple lock-free linearizable stack algorithm that works at low loads but does not scale as the load increases. The question of designing a stack algorithm that is non-blocking, linearizable, and scales well throughout the concurrency range, has thus remained open. This paper presents such a concurrent stack algorithm. It is based on the following simple observation: that a single elimination array used as a backoff scheme for a simple lock-free stack is lock-free, linearizable, and scalable. As our empirical results show, the resulting elimination-backoff stack performs as well as the simple stack at low loads, and increasingly outperforms all other methods (lock-based and non-blocking) as concurrency increases. We believe its simplicity and scalability make it a viable practical alternative to existing constructions for implementing concurrent stacks.
symposium on principles of programming languages | 2011
Hagit Attiya; Rachid Guerraoui; Danny Hendler; Maged M. Michael; Martin T. Vechev
Building correct and efficient concurrent algorithms is known to be a difficult problem of fundamental importance. To achieve efficiency, designers try to remove unnecessary and costly synchronization. However, not only is this manual trial-and-error process ad-hoc, time consuming and error-prone, but it often leaves designers pondering the question of: is it inherently impossible to eliminate certain synchronization, or is it that I was unable to eliminate it on this attempt and I should keep trying? In this paper we respond to this question. We prove that it is impossible to build concurrent implementations of classic and ubiquitous specifications such as sets, queues, stacks, mutual exclusion and read-modify-write operations, that completely eliminate the use of expensive synchronization. We prove that one cannot avoid the use of either: i) read-after-write (RAW), where a write to shared variable A is followed by a read to a different shared variable B without a write to B in between, or ii) atomic write-after-read (AWAR), where an atomic operation reads and then writes to shared locations. Unfortunately, enforcing RAW or AWAR is expensive on all current mainstream processors. To enforce RAW, memory ordering--also called fence or barrier--instructions must be used. To enforce AWAR, atomic instructions such as compare-and-swap are required. However, these instructions are typically substantially slower than regular instructions. Although algorithm designers frequently struggle to avoid RAW and AWAR, their attempts are often futile. Our result characterizes the cases where avoiding RAW and AWAR is impossible. On the flip side, our result can be used to guide designers towards new algorithms where RAW and AWAR can be eliminated.
principles of distributed computing | 2002
Danny Hendler; Nir Shavit
The non-blocking work-stealing algorithm of Arora et al. has been gaining popularity as the multiprocessor load balancing technology of choice in both Industry and Academia. At its core is an ingenious scheme for stealing a single item in a non-blocking manner from an array based deque. In recent years, several researchers have argued that stealing more than a single item at a time allows for increased stability, greater overall balance, and improved performance.This paper presents StealHalf, a new generalization of the Arora et al. algorithm, that allows processes, instead of stealing one, to steal up to half of the items in a given queue at a time. The new algorithm preserves the key properties of the Arora et al. algorithm: it is non-blocking, and it minimizes the number of CAS operations that the local process needs to perform. We provide analysis that proves that the new algorithm provides better load distribution: the expected load of any process throughout the execution is less than a constant away from the overall system average.
Distributed Computing | 2006
Danny Hendler; Yossi Lev; Mark Moir; Nir Shavit
The non-blocking work-stealing algorithm of Arora, Blumofe, and Plaxton (hencheforth ABP work-stealing) is on its way to becoming the multiprocessor load balancing technology of choice in both industry and academia. This highly efficient scheme is based on a collection of array-based double-ended queues (deques) with low cost synchronization among local and stealing processes. Unfortunately, the algorithms synchronization protocol is strongly based on the use of fixed size arrays, which are prone to overflows, especially in the multiprogrammed environments for which they are designed. This is a significant drawback since, apart from memory inefficiency, it means that the size of the deque must be tailored to accommodate the effects of the hard-to-predict level of multiprogramming, and the implementation must include an expensive and application-specific overflow mechanism.This paper presents the first dynamic memory work-stealing algorithm. It is based on a novel way of building non-blocking dynamic-sized work stealing deques by detecting synchronization conflicts based on “pointer-crossing” rather than “gaps between indexes” as in the original ABP algorithm. As we show, the new algorithm dramatically increases robustness and memory efficiency, while causing applications no observable performance penalty. We therefore believe it can replace array-based ABP work stealing deques, eliminating the need for application-specific overflow mechanisms.
Journal of the ACM | 2009
Hagit Attiya; Rachid Guerraoui; Danny Hendler
Obstruction-free implementations of concurrent objects are optimized for the common case where there is no step contention, and were recently advocated as a solution to the costs associated with synchronization without locks. In this article, we study this claim and this goes through precisely defining the notions of obstruction-freedom and step contention. We consider several classes of obstruction-free implementations, present corresponding generic object implementations, and prove lower bounds on their complexity. Viewed collectively, our results establish that the worst-case operation time complexity of obstruction-free implementations is high, even in the absence of step contention. We also show that lock-based implementations are not subject to some of the time-complexity lower bounds we present.
IEEE Transactions on Computers | 2012
Anat Bremler-Barr; Danny Hendler
Ternary content-addressable memories (TCAMs) are increasingly used for high-speed packet classification. TCAMs compare packet headers against all rules in a classification database in parallel and thus provide high throughput unparalleled by software-based solutions. TCAMs are not well-suited, however, for representing rules that contain range fields. Such rules have to be represented by multiple TCAM entries. The resulting range expansion can dramatically reduce TCAM utilization. The majority of real-life database ranges are short. We present a novel algorithm called short range gray encoding (SRGE) for the efficient representation of short range rules. SRGE encodes range borders as binary reflected gray codes and then represents the resulting range by a minimal set of ternary strings. SRGE is database independent and does not use TCAM extra bits. For the small number of ranges whose expansion is not significantly reduced by SRGE, we use dependent encoding that exploits the extra bits available on todays TCAMs. Our comparative analysis establishes that this hybrid scheme utilizes TCAM more efficiently than previously published solutions. The SRGE algorithm has worst-case expansion ratio of 2W-4, where W is the range-field length . We prove that any TCAM encoding scheme has worst-case expansion ratio W or more.
international conference on trust management | 2008
Nurit Gal-Oz; Ehud Gudes; Danny Hendler
Virtual communities become more and more heterogeneous as their scale increases. This implies that, rather than being a single, homogeneous community, they become a collection of knots (or sub-communities) of users. For the computation of a member’s reputation to be useful, the system must therefore identify the community knot to which this member belongs and to interpret its reputation data correctly. Unfortunately, to the best of our knowledge existing trust-based reputation models treat a community as a single entity and do not explicitly address this issue. In this paper, we introduce the knot-aware trust-based reputation model for large-scale virtual communities. We define a knot as a group of community members having overall “strong” trust relations between them. Different knots typically represent different view points and preferences. It is therefore plausible that the reputation of the same member in different knots assign may differ significantly. Using our knot-aware approach, we can deal with heterogeneous communities where a member’s reputation may be distributed in a multi modal manner. As we show, an interesting and beneficial feature of our knot-aware model is that it naturally prevents malicious attempts to bias community members’ reputation. Nurit Gal-Oz Deutsche Telekom Laboratories at Ben-Gurion University, Beer-Sheva, 84105, Israel, e-mail: [email protected] Ehud Gudes Deutsche Telekom Laboratories at Ben-Gurion University, Beer-Sheva, 84105, Israel, e-mail: [email protected] Danny Hendler Deutsche Telekom Laboratories at Ben-Gurion University, Beer-Sheva, 84105, Israel, e-mail: [email protected]
international conference on computer communications | 2009
Anat Bremler-Barr; David Hay; Danny Hendler
Ternary content-addressable memories (TCAMs) are increasingly used for high-speed packet classification. TCAMs compare packet headers against all rules in a classification database in parallel and thus provide high throughput. TCAMs are not well-suited, however, for representing rules that contain range fields and prior art algorithms typically represent each such rule by multiple TCAM entries. The resulting range expansion can dramatically reduce TCAM utilization because it introduces a large number of redundant TCAM entries. This redundancy can be mitigated by making use of extra bits, available in each TCAM entry. We present a scheme for constructing efficient representations of range rules, based on the simple observation that sets of disjoint ranges may be encoded much more efficiently than sets of overlapping ranges. Since the ranges in real-world classification databases are, in general, non-disjoint, the algorithms we present split ranges between multiple layers each of which consists of mutually disjoint ranges. Each layer is then coded independently and assigned its own set of extra bits. Our layering algorithms are based on approximations for specific variants of interval-graph coloring. We evaluate these algorithms by performing extensive comparative analysis on real-life classification databases. Our analysis establishes that our algorithms reduce the number of redundant TCAM entries caused by range rules by more than 60% as compared with best range-encoding prior art.
foundations of computer science | 2005
Faith E. Fich; Danny Hendler; Nir Shavit
This paper proves /spl Omega/(n) lower bounds on the time to perform a single instance of an operation in any implementation of a large class of data structures shared by n processes. For standard data structures such as counters, stacks, and queues, the bound is tight. The implementations considered may apply any deterministic primitives to a base object. No bounds are assumed on either the number of base objects or their size. Time is measured as the number of steps a process performs on base objects and the number of stalls it incurs as a result of contention with other processes.