Austen McDonald | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Austen McDonald is active.

Explore More

Publication

Featured researches published by Austen McDonald.

international symposium on computer architecture | 2007

An effective hybrid transactional memory system with strong isolation guarantees

Chi Cao Minh; Martin Trautmann; JaeWoong Chung; Austen McDonald; Nathan Grasso Bronson; Jared Casper; Christos Kozyrakis; Kunle Olukotun

We propose signature-accelerated transactional memory (SigTM), ahybrid TM system that reduces the overhead of software transactions. SigTM uses hardware signatures to track the read-set and write-set forpending transactions and perform conflict detection between concurrent threads. All other transactional functionality, including dataversioning, is implemented in software. Unlike previously proposed hybrid TM systems, SigTM requires no modifications to the hardware caches, which reduces hardware cost and simplifies support for nested transactions and multithreaded processor cores. SigTM is also the first hybrid TM system to provide strong isolation guarantees between transactional blocks and non-transactional accesses without additional read and write barriers in non-transactional code. Using a set of parallel programs that make frequent use of coarse-grain transactions, we show that SigTM accelerates software transactions by 30% to 280%. For certain workloads, SigTM can match the performance of a full-featured hardware TM system, while for workloads with large read-sets it can be up to two times slower. Overall, we show that SigTM combines the performance characteristics and strong isolation guarantees of hardware TM implementations with the low cost and flexibility of software TM systems.

international symposium on computer architecture | 2006

Architectural Semantics for Practical Transactional Memory

Austen McDonald; JaeWoong Chung; Brian D. Carlstrom; Chi Cao Minh; Hassan Chafi; Christos Kozyrakis; Kunle Olukotun

Transactional Memory (TM) simplifies parallel programming by allowing for parallel execution of atomic tasks. Thus far, TM systems have focused on implementing transactional state buffering and conflict resolution. Missing is a robust hardware/software interface, not limited to simplistic instructions defining transaction boundaries. Without rich semantics, current TM systems cannot support basic features of modern programming languages and operating systems such as transparent library calls, conditional synchronization, system calls, I/O, and runtime exceptions. This paper presents a comprehensive instruction set architecture (ISA) for TM systems. Our proposal introduces three key mechanisms: two-phase commit; support for software handlers on commit, violation, and abort; and full support for open- and closed-nested transactions with independent rollback. These mechanisms provide a flexible interface to implement programming language and operating system functionality. We also show that these mechanisms are practical to implement at the ISA and microarchitecture level for various TM systems. Using an execution-driven simulation, we demonstrate both the functionality (e.g., I/O and conditional scheduling within transactions) and performance potential (2.2� improvement for SPECjbb2000) of the proposed mechanisms. Overall, this paper establishes a rich and efficient interface to foster both hardware and software research on transactional memory.

programming language design and implementation | 2006

The Atomos transactional programming language

Brian D. Carlstrom; Austen McDonald; Hassan Chafi; JaeWoong Chung; Chi Cao Minh; Christoforos E. Kozyrakis; Kunle Olukotun

Atomos is the first programming language with implicit transactions, strong atomicity, and a scalable multiprocessor implementation. Atomos is derived from Java, but replaces its synchronization and conditional waiting constructs with simpler transactional alternatives.The Atomos watch statement allows programmers to specify fine-grained watch sets used with the Atomos retry conditional waiting statement for efficient transactional conflict-driven wakeup even in transactional memory systems with a limited number of transactional contexts. Atomos supports open-nested transactions, which are necessary for building both scalable application programs and virtual machine implementations.The implementation of the Atomos scheduler demonstrates the use of open nesting within the virtual machine and introduces the concept of transactional memory violation handlers that allow programs to recover from data dependency violations without rolling back.Atomos programming examples are given to demonstrate the usefulness of transactional programming primitives. Atomos and Java are compared through the use of several benchmarks. The results demonstrate both the improvements in parallel programming ease and parallel program performance provided by Atomos.

high-performance computer architecture | 2007

A Scalable, Non-blocking Approach to Transactional Memory

Hassan Chafi; Jared Casper; Brian D. Carlstrom; Austen McDonald; Chi Cao Minh; Woongki Baek; Christos Kozyrakis; Kunle Olukotun

Transactional memory (TM) provides mechanisms that promise to simplify parallel programming by eliminating the need for locks and their associated problems (deadlock, livelock, priority inversion, convoying). For TM to be adopted in the long term, not only does it need to deliver on these promises, but it needs to scale to a high number of processors. To date, proposals for scalable TM have relegated livelock issues to user-level contention managers. This paper presents the first scalable TM implementation for directory-based distributed shared memory systems that is livelock free without the need for user-level intervention. The design is a scalable implementation of optimistic concurrency control that supports parallel commits with a two-phase commit protocol, uses write-back caches, and filters coherence messages. The scalable design is based on transactional coherence and consistency (TCC), which supports continuous transactions and fault isolation. A performance evaluation of the design using both scientific and enterprise benchmarks demonstrates that the directory-based TCC design scales efficiently for NUMA systems up to 64 processors

high-performance computer architecture | 2006

The common case transactional behavior of multithreaded programs

JaeWoong Chung; Hassan Chafi; Chi Cao Minh; Austen McDonald; Brian D. Carlstrom; Christos Kozyrakis; Kunle Olukotun

Transactional memory (TM) provides an easy-to-use and high-performance parallel programming model for the upcoming chip-multiprocessor systems. Several researchers have proposed alternative hardware and software TM implementations. However, the lack of transaction-based programs makes it difficult to understand the merits of each proposal and to tune future TM implementations to the common case behavior of real application. This work addresses this problem by analyzing the common case transactional behavior for 35 multithreaded programs from a wide range of application domains. We identify transactions within the source code by mapping existing primitives for parallelism and synchronization management to transaction boundaries. The analysis covers basic characteristics such as transaction length, distribution of read-set and write-set size, and the frequency of nesting and I/O operations. The measured characteristics provide key insights into the design of efficient TM systems for both non-blocking synchronization and speculative parallelization.

architectural support for programming languages and operating systems | 2006

Tradeoffs in transactional memory virtualization

JaeWoong Chung; Chi Cao Minh; Austen McDonald; Travis Skare; Hassan Chafi; Brian D. Carlstrom; Christos Kozyrakis; Kunle Olukotun

For transactional memory (TM) to achieve widespread acceptance, transactions should not be limited to the physical resources of any specific hardware implementation. TM systems should guarantee correct execution even when transactions exceed scheduling quanta, overflow the capacity of hardware caches and physical memory, or include more independent nesting levels than what is supported in hardware. Existing proposals for TM virtualization are either incomplete or rely on complex hardware implementations, which are an overkill if virtualization is invoked infrequently in the common case.We present eXtended Transactional Memory (XTM), the first TM virtualization system that virtualizes all aspects of transactional execution (time, space, and nesting depth). XTM is implemented in software using virtual memory support. It operates at page granularity, using private copies of overflowed pages to buffer memory updates until the transaction commits and snapshots of pages to detect interference between transactions. We also describe two enhancements to XTM that use limited hardware support to address key performance bottlenecks.We compare XTM to hardwarebased virtualization using both real applications and synthetic microbenchmarks. We show that despite being software-based, XTM and its enhancements are competitive with hardware-based alternatives. Overall, we demonstrate that XTM provides a complete, flexible, and low-cost mechanism for practical TM virtualization.

international conference on parallel architectures and compilation techniques | 2005

Characterization of TCC on chip-multiprocessors

Austen McDonald; JaeWoong Chung; Hassan Chafi; Chi Cao Minh; Brian D. Carlstrom; Lance Hammond; Christos Kozyrakis; Kunle Olukotun

Transactional coherence and consistency (TCC) is a novel coherence scheme for shared memory multiprocessors that uses programmer-defined transactions as the fundamental unit of parallel work, synchronization, coherence, and consistency. TCC has the potential to simplify parallel program development and optimization by providing a smooth transition from sequential to parallel programs. In this paper, we study the implementation of TCC on chip-multiprocessors (CMPs). We explore design alternatives such as the granularity of state tracking, double-buffering, and write-update and write-invalidate protocols. Furthermore, we characterize the performance of TCC in comparison to conventional snoopy cache coherence (SCC) using parallel applications optimized for each scheme. We conclude that the two coherence schemes perform similarly, with each scheme having a slight advantage for some applications. The bandwidth requirements of TCC are slightly higher but well within the capabilities of CMP systems. Also, we find that overflow of speculative state can be effectively handled by a simple victim cache. Our results suggest TCC can provide its programming advantages without compromising the performance expected from well-tuned parallel applications.

acm sigplan symposium on principles and practice of parallel programming | 2007

Transactional collection classes

Brian D. Carlstrom; Austen McDonald; Michael Carbin; Christos Kozyrakis; Kunle Olukotun

While parallel programmers find it easier to reason about large atomic regions, the conventional mutual exclusion-based primitives for synchronization force them to interleave many small operations to achieve performance. Transactional memory promises that programmer scan use large atomic regions while achieving similar performance. However, these large transactions can conflict when operating on shared data structures, even for logically independent operations. Transactional collection classes address this problem by allowing long-running transactions to operate on shared data while eliminating unnecessary conflicts. Transactional collection classes wrap existing data structures, without the need for custom implementations or knowledge of data structure internals. Without transactional collection classes, access to shared datafrom within long-running transactions can suffer from data dependency conflicts that are logically unnecessary, but are artifacts of the data structure implementation such as hash table collisions or tree-balancing rotations. Our transactional collection classes use the concept of semantic concurrency control to eliminate these unnecessary data dependencies, replacing them with conflict detection based on the operations of the abstract data type. The design and behavior of these transactional collection classes is discussed with reference to the related work from the database community such as multi-level transactions and semantic concurrency control, as well as other concurrent data structures such as java.util.concurrent. The required transactional semantics needed for implementing transactional collection are enumerated, including open-nested transactions and commit and abort handlers. We also discuss how isolation can be reduced for greater concurrency. Finally, we provide guidelines on the construction of classes that preserve isolation and serializability. The performance of these classes is evaluated with a number of benchmarks including targeted micro-benchmarks and a version of SPECjbb2000 with increased contention. The results show that easier-to-use long transactions can still allow programs to deliver scalable performance by simply wrapping existing data structures with transactional collection classes.

international conference on parallel architectures and compilation techniques | 2006

Testing implementations of transactional memory

Chaiyasit Manovit; Sudheendra Hangal; Hassan Chafi; Austen McDonald; Christos Kozyrakis; Kunle Olukotun

Transactional memory is an attractive design concept for scalable multiprocessors because it offers efficient lock-free synchronization and greatly simplifies parallel software. Given the subtle issues involved with concurrency and atomicity, however, it is important that transactional memory systems be carefully designed and aggressively tested to ensure their correctness. In this paper, we propose an axiomatic framework to model the formal specification of a realistic transactional memory system which may contain a mix of transactional and non-transactional operations. Using this framework and extensions to analysis algorithms originally developed for checking traditional memory consistency, we show that the widely practiced pseudo-random testing methodology can be effectively applied to transactional memory systems. Our testing methodology was successful in finding previously unknown bugs in the implementation of TCC, a transactional memory system. We study two flavors of the underlying analysis algorithm, one incomplete and the other complete, and show that the complete algorithm while being theoretically intractable is very efficient in practice.

international conference on supercomputing | 2005

TAPE: a transactional application profiling environment

Hassan Chafi; Chi Cao Minh; Austen McDonald; Brian D. Carlstrom; JaeWoong Chung; Lance Hammond; Christos Kozyrakis; Kunle Olukotun

Transactional Coherence and Consistency (TCC) provides a new parallel programming model that uses transactions as the basic unit of parallel work and communication. TCC simplifies the development of correct parallel code because hardware provides transaction atomicity and ordering. Nevertheless, the programmer or a dynamic compiler must still optimize the parallel code for performance.This paper presents TAPE, a hardware and software infrastructure for profiling in TCC systems. TAPE extends the hardware for transactional execution to identify performance impediments such as dependence violations, buffer overflows, and work imbalance. It filters infrequent events to reduce resource requirements and allows the programmer to focus on the most important bottlenecks. We demonstrate that TAPE introduces minimal die area and performance overhead and can be used continuously, even for production runs. Moreover, we demonstrate how to leverage the profiling information to guide optimization for a set of parallel applications. TAPE accurately identifies the source code location and type of the most important bottlenecks, allowing a programmer to achieve maximum parallel speedup with a few profiling steps.

Explore More