Paul E. McKenney
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paul E. McKenney.
IEEE Transactions on Parallel and Distributed Systems | 2012
Mathieu Desnoyers; Paul E. McKenney; Alan S. Stern; Michel Dagenais; Jonathan Walpole
Read-copy update (RCU) is a synchronization technique that often replaces reader-writer locking because RCUs read-side primitives are both wait-free and an order of magnitude faster than uncontended locking. Although RCU updates are relatively heavy weight, the importance of read-side performance is increasing as computing systems become more responsive to changes in their environments. RCU is heavily used in several kernel-level environments. Unfortunately, kernel-level implementations use facilities that are often unavailable to user applications. The few prior user-level RCU implementations either provided inefficient read-side primitives or restricted the application architecture. This paper fills this gap by describing efficient and flexible RCU implementations based on primitives commonly available to user-level applications. Finally, this paper compares these RCU implementations with each other and with standard locking, which enables choosing the best mechanism for a given workload. This work opens the door to widespread user-application use of RCU.
Journal of Parallel and Distributed Computing | 2007
Thomas E. Hart; Paul E. McKenney; Angela Demke Brown; Jonathan Walpole
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking require a memory reclamation scheme that reclaims elements once they are no longer in use. The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes-quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation-using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect the memory reclamation performance. We discuss the consequences of our results for programmers and algorithm designers. Finally, we describe the use of one scheme, quiescent-state-based reclamation, in the context of an OS kernel-an execution environment which is well suited to this scheme.
acm special interest group on data communication | 1992
Paul E. McKenney; Ken F. Dove
When a transport protocol segment arrives at a receiving system, the receiving system must determine which application is to receive the protocol segment. This decision is typically made by looking up a protocol control block (PCB) for the segment, based on information in the segments header. PCB lookup (a form of demultiplexing) is typically one of the more expensive operations in handling inbound protocol segment [Fe190]. Many recent protocol optimizations for the Transmission Control Protocol (TCP) [Jac88] assume that a large component of TCP traffic is bulk-data transfers, which result in packet trains [JR86]. If packet trains are prevalent, there is a high likelihood that the next TCP segment is en route to the same application (i.e. uses the same PCB) as the previous TCP segment. In these environments a very simple one-PCB cache like those used in BSD systems yields very high cache hit rates. However, there are classes of applications that do not form packet trains, and these applications do not perform well with a one-PCB cache. Examples of such applications are quite common in the area of heads-down data entry into on-line transaction-processing (OLTP) systems. OLTP systems make heavy use of computer communications networks and have large aggregate-packet-rates but are also characterized by large numbers of connections, low per-connection packet rates, and rather small packets. This combination of characteristics results in a very low incidence of packet trains. This paper uses a simple analytic approach to examine how different PCB lookup schemes perform with OLTP traffic. One scheme is shown to work an order of magnitude better for OLTP traffic than the one-PCB cache approach while still maintaining good performance for packet-train traffic.
Ibm Systems Journal | 2008
Dinakar Guniguntala; Paul E. McKenney; Josh Triplett; Jonathan Walpole
Read-copy update (RCU) is a synchronization mechanism in the Linux™ kernel that provides significant improvements in multiprocessor scalability by eliminating the writer-delay problem of readers-writer locking. RCU implementations to date, however, have had the side effect of expanding non-preemptible regions of code, thereby degrading real-time response. We present here a variant of RCU that allows preemption of read-side critical sections and thus is better suited for real-time applications. We summarize priority-inversion issues with locking, present an overview of the RCU mechanism, discuss our counter-based adaptation of RCU for real-time use, describe an additional adaptation of RCU that permits general blocking in read-side critical sections, and present performance results. We also discuss an approach for replacing the readers-writer synchronization with RCU in existing implementations.
Operating Systems Review | 2010
Josh Triplett; Paul E. McKenney; Jonathan Walpole
This paper presents a novel concurrent hash table implementation which supports wait-free, near-linearly scalable lookup, even in the presence of concurrent modifications. In particular, this hash table implementation supports concurrent moves of hash table elements between buckets, for purposes such as renames. Implementation of this algorithm in the Linux kernel demonstrates its performance and scalability. Benchmarks on a 64-way POWER system showed a 6x scalability improvement versus fine-grained locking, and a 1.5x improvement versus the current state of the art in Linux. To achieve these scalability improvements, the hash table implementation uses a new concurrent programming technique known as relativistic programming. This approach uses a copy-based update strategy to allow readers and writers to run concurrently without conflicts, avoiding many of the non-scalable costs of synchronization, inter-processor communication, and cache coherence. New techniques such as the proposed hash-table move algorithm allow readers to tolerate the resulting weak memory-ordering behavior that arises from allowing one version of a structure to be read concurrently with updates to a different version of the same structure. Relativistic programming techniques provide performance and scalability advantages over traditional synchronization, as demonstrated through several benchmarks.
international parallel and distributed processing symposium | 2006
Thomas E. Hart; Paul E. McKenney; Angela Demke Brown
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking, require a memory reclamation scheme that reclaims nodes once they are no longer in use. The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes -quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation - using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect memory reclamation performance
Operating Systems Review | 2008
Paul E. McKenney; Jonathan Walpole
There can be no doubt that a great many technologies have been added to Linux#8482; over the past ten years. What is less well-known is that it is often necessary to introduce a large amount of Linux into a given technology in order to successfully introduce that technology into Linux. This paper illustrates such an introduction of Linux into technology with Read-Copy Update (RCU). The RCU APIs evolution over time clearly shows that Linuxs extremely diverse set of workloads and platforms has changed RCU to a far greater degree than RCU has changed Linux---and it is reasonable to expect that other technologies that might be proposed for inclusion into Linux would face similar challenges. In addition, this paper presents a summary of lessons learned and an attempt to foresee what additional challenges Linux might present to RCU.
programming languages and operating systems | 2007
Paul E. McKenney; Maged M. Michael; Jonathan Walpole
The advent of multi-core and multi-threaded processor architectures highlights the need to address the well-known shortcomings of the ubiquitous lock-based synchronization mechanisms. The emerging transactional-memory synchronization mechanism is viewed as a promising alternative to locking for high-concurrency environments, including operating systems. This paper presents a constructive critique of locking and transactional memory: their strengths, weaknesses, and challenges
euromicro conference on real time systems | 2012
Thomas Gleixner; Paul E. McKenney; Vincent Guittot
Linuxs CPU-Hotplug facility was originally designed to allow failing hardware to be removed from a running system. Hardware fails quite infrequently, so CPU-hotplug performance (much less real-time response) was not a major consideration. However, CPU hotplug is now used for energy management and (believe it or not!) real-time response, both of which have unsurprisingly exposed some shortcomings in CPU hotplug. This document reviews a number of these shortcomings, and then proposes an alternative CPU-hotplug approach that we believe will address these shortcomings.
Ibm Systems Journal | 2008
Robert Francis Berry; Paul E. McKenney; Francis Nicholas Parr
This paper introduces responsive systems: systems that are real-time, event-based, or time-dependent. There are a number of trends that are accelerating the adoption of responsive systems: timeliness requirements for business information systems are becoming more prevalent, embedded systems are increasingly integrated into soft real-time command-and-control systems, improved message-oriented middleware is facilitating growth in event-processing applications, and advances in service-oriented and component-based techniques are lowering the costs of developing and deploying responsive applications. The use of responsive systems is illustrated here in two application areas: the defense industry and online gaming. The papers in this special issue of the IBM Systems Journal are then introduced. The paper concludes with a discussion of the key remaining challenges in this area and ideas for further work.