Maged M. Michael | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Maged M. Michael is active.

Explore More

Publication

Featured researches published by Maged M. Michael.

principles of distributed computing | 1996

Simple, fast, and practical non-blocking and blocking concurrent queue algorithms

Maged M. Michael; Michael L. Scott

Drawing ideas from previous authors, we present a new non-blocking concurrent queue algorithm and a new two-lock queue algorithm in which one enqueue and one dequeue can proceed concurrently. Both algorithms are simple, fast, and practical; we were surprised not to find them in the literature. Experiments on a 12-node SGI Challenge multiprocessor indicate that the new non-blocking queue consistently outperforms the best known alternatives; it is the clear algorithm of choice for machines that provide a universal atomic primitive (e.g., compare_and_swap or load_linked/store_conditional). The two-lock concurrent queue outperforms a single lock when several processes are competing simultaneously for access; it appears to be the algorithm of choice for busy queues on machines with non-universal atomic primitives (e.g., test_and_set). Since much of the motivation for non-blocking algorithms is rooted in their immunity to large, unpredictable delays in process execution, we report experimental results both for systems with dedicated processors and for systems with several processes multiprogrammed on each processor.

IEEE Transactions on Parallel and Distributed Systems | 2004

Hazard pointers: safe memory reclamation for lock-free objects

Maged M. Michael

Lock-free objects offer significant performance and reliability advantages over conventional lock-based objects. However, the lack of an efficient portable lock-free method for the reclamation of the memory occupied by dynamic nodes removed from such objects is a major obstacle to their wide use in practice. We present hazard pointers, a memory management methodology that allows memory reclamation for arbitrary reuse. It is very efficient, as demonstrated by our experimental results. It is suitable for user-level applications - as well as system programs - without dependence on special kernel or scheduler support. It is wait-free. It requires only single-word reads and writes for memory access in its core operations. It allows reclaimed memory to be returned to the operating system. In addition, it offers a lock-free solution for the ABA problem using only practical single-word instructions. Our experimental results on a multiprocessor system show that the new methodology offers equal and, more often, significantly better performance than other memory management methods, in addition to its qualitative advantages regarding memory reclamation and independence of special hardware support. We also show that lock-free implementations of important object types, using hazard pointers, offer comparable performance to that of efficient lock-based implementations under no contention and no multiprogramming, and outperform them by significant margins under moderate multiprogramming and/or contention, in addition to guaranteeing continuous progress and availability, even in the presence of thread failures and arbitrary delays.

acm symposium on parallel algorithms and architectures | 2002

High performance dynamic lock-free hash tables and list-based sets

Maged M. Michael

Lock-free (non-blocking) shared data structures promise more robust performance and reliability than conventional lock-based implementations. However, all prior lock-free algorithms for sets and hash tables suffer from serious drawbacks that prevent or limit their use in practice. These drawbacks include size inflexibility, dependence on atomic primitives not supported on any current processor architecture, and dependence on highly-inefficient or blocking memory management techniques.Building on the results of prior researchers, this paper presents the first CAS-based lock-free list-based set algorithm that is compatible with all lock-free memory management methods. We use it as a building block of an algorithm for lock-free hash tables. In addition to being lock-free, the new algorithm is dynamic, linearizable, and space-efficient.Our experimental results show that the new algorithm outperforms the best known lock-free as well as lock-based hash table implementations by significant margins, and indicate that it is the algorithm of choice for implementing shared hash tables.

Journal of Parallel and Distributed Computing | 1998

Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors

Maged M. Michael; Michael L. Scott

Most multiprocessors are multiprogrammed to achieve acceptable response time and to increase their utilization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for a concurrent, atomic update of shared data structures: (1)preemption-safe lockingand (2)nonblocking(lock-free)algorithms. Preemption-safe locking requires kernel support. Nonblocking algorithms generally require a universal atomic primitive such ascompare-and-swaporload-linked/store-conditionaland are widely regarded as inefficient. We evaluate the performance of preemption-safe lock-based and nonblocking implementations of important data structures?queues, stacks, heaps, and counters?including nonblocking and lock-based queue algorithms of our own, in microbenchmarks and real applications on a 12-processor SGI Challenge multiprocessor. Our results indicate that our nonblocking queue consistently outperforms the best known alternatives and that data-structure-specific nonblocking algorithms, which exist for queues, stacks, and counters, can work extremely well. Not only do they outperform preemption-safe lock-based algorithms on multiprogrammed machines, they also outperform ordinary locks on dedicated machines. At the same time, since general-purpose nonblocking techniques do not yet appear to be practical, preemption-safe locks remain the preferred alternative for complex data structures: they outperform conventional locks by significant margins on multiprogrammed systems.

programming language design and implementation | 2004

Scalable lock-free dynamic memory allocation

Maged M. Michael

Dynamic memory allocators (malloc/free) rely on mutual exclusion locks for protecting the consistency of their shared data structures under multithreading. The use of locking has many disadvantages with respect to performance, availability, robustness, and programming flexibility. A lock-free memory allocator guarantees progress regardless of whether some threads are delayed or even killed and regardless of scheduling policies. This paper presents a completely lock-free memory allocator. It uses only widely-available operating system support and hardware atomic instructions. It offers guaranteed availability even under arbitrary thread termination and crash-failure, and it is immune to deadlock regardless of scheduling policies, and hence it can be used even in interrupt handlers and real-time applications without requiring special scheduler support. Also, by leveraging some high-level structures from Hoard, our allocator is highly scalable, limits space blowup to a constant factor, and is capable of avoiding false sharing. In addition, our allocator allows finer concurrency and much lower latency than Hoard. We use PowerPC shared memory multiprocessor systems to compare the performance of our allocator with the default AIX 5.1 libc malloc, and two widely-used multithread allocators, Hoard and Ptmalloc. Our allocator outperforms the other allocators in virtually all cases and often by substantial margins, under various levels of parallelism and allocation patterns. Furthermore, our allocator also offers the lowest contention-free latency among the allocators by significant margins.

principles of distributed computing | 2002

Safe memory reclamation for dynamic lock-free objects using atomic reads and writes

Maged M. Michael

A major obstacle to the wide use of lock-free data structures, despite their many performance and reliability advantages, is the absence of a practical lock-free method for reclaiming the memory of dynamic nodes removed from dynamic lock-free objects for arbitrary reuse.The only prior lock-free memory reclamation method depends on the DCAS atomic primitive, which is not supported on any current processor architecture. Other memory management methods are blocking, require special operating system support, or do not allow arbitrary memory reuse.This paper presents the first lock-free memory management method for dynamic lock-free objects that allows arbitrary memory reuse, and does not require special operating system or hardware support. It guarantees an upper bound on the number of removed nodes not yet freed at any time, regardless of thread failures and delays. Furthermore, it is wait-free, it is only logarithmically contention-sensitive, and it uses only atomic reads and writes for its operations. In addition, it can be used to prevent the ABA problem for pointers to dynamic nodes in most algorithms, without requiring extra space per pointer or per node.

international conference on parallel architectures and compilation techniques | 2012

Evaluation of Blue Gene/Q hardware support for transactional memories

Amy Wang; Matthew Gaudet; Peng Wu; José Nelson Amaral; Martin Ohmacht; Christopher Barton; Raul Esteban Silvera; Maged M. Michael

This paper describes an end-to-end system implementation of the transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q (BG/Q) machine. The TM programming model supports most C/C++ programming constructs on top of a best-effort HTM with the help of a complete software stack including the compiler, the kernel, and the TM runtime. An extensive evaluation of the STAMP benchmarks on BG/Q is the first of its kind in understanding characteristics of running coarse-grained TM workloads on HTMs. The study reveals several interesting insights on the overhead and the scalability of BG/Q HTM with respect to sequential execution, coarse-grain locking, and software TM.

acm symposium on parallel algorithms and architectures | 2008

RingSTM: scalable transactions with a single atomic instruction

Michael F. Spear; Maged M. Michael; Christoph von Praun

Existing Software Transactional Memory (STM) designs attach metadata to ranges of shared memory; subsequent runtime instructions read and update this metadata in order to ensure that an in-flight transactions reads and writes remain correct. The overhead of metadata manipulation and inspection is linear in the number of reads and writes performed by a transaction, and involves expensive read-modify-write instructions, resulting in substantial overheads. We consider a novel approach to STM, in which transactions represent their read and write sets as Bloom filters, and transactions commit by enqueuing a Bloom filter onto a global list. Using this approach, our RingSTM system requires at most one read-modify-write operation for any transaction, and incurs validation overhead linear not in transaction size, but in the number of concurrent writers who commit. Furthermore, RingSTM is the first STM that is inherently livelock-free and privatization-safe while at the same time permitting parallel writeback by concurrent disjoint transactions.We evaluate three variants of the RingSTM algorithm, and find that it offers superior performance and/or stronger semantics than the state-of-the-art TL2 algorithm under a number of workloads.

international parallel and distributed processing symposium | 2007

Scale-up x Scale-out: A Case Study using Nutch/Lucene

Maged M. Michael; José E. Moreira; Doron Shiloach; Robert W. Wisniewski

Scale-up solutions in the form of large SMPs have represented the mainstream of commercial computing for the past several years. The major server vendors continue to provide increasingly larger and more powerful machines. More recently, scale-out solutions, in the form of clusters of smaller machines, have gained increased acceptance for commercial computing. Scale-out solutions are particularly effective in high-throughput Web-centric applications. In this paper, we investigate the behavior of two competing approaches to parallelism, scale-up and scale-out, in an emerging search application. Our conclusions show that a scale-out strategy can be the key to good performance even on a scale-up machine. Furthermore, scale-out solutions offer better price/performance, although at an increase in management complexity.

acm sigplan symposium on principles and practice of parallel programming | 2009

Idempotent work stealing

Maged M. Michael; Martin T. Vechev; Vijay A. Saraswat

Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Work-stealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and occasionally steals items from the sets of other threads. The conventional semantics of work stealing guarantee that each inserted task is eventually extracted exactly once. However, correctness of a wide class of applications allows for relaxed semantics, because either: i) the application already explicitly checks that no work is repeated or ii) the application can tolerate repeated work. In this paper, we introduce idempotent work tealing, and present several new algorithms that exploit the relaxed semantics to deliver better performance. The semantics of the new algorithms guarantee that each inserted task is eventually extracted at least once-instead of exactly once. On mainstream processors, algorithms for conventional work stealing require special atomic instructions or store-load memory ordering fence instructions in the owners critical path operations. In general, these instructions are substantially slower than regular memory access instructions. By exploiting the relaxed semantics, our algorithms avoid these instructions in the owners operations. We evaluated our algorithms using common graph problems and micro-benchmarks and compared them to well-known conventional work stealing algorithms, the THE Cilk and Chase-Lev algorithms. We found that our best algorithm (with LIFO extraction) outperforms existing algorithms in nearly all cases, and often by significant margins.

Explore More