Christopher B. Colohan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher B. Colohan is active.

Explore More

Publication

Featured researches published by Christopher B. Colohan.

international symposium on computer architecture | 2000

A scalable approach to thread-level speculation

J. Greggory Steffan; Christopher B. Colohan; Antonia Zhai; Todd C. Mowry

While architects understand how to build cost-effective parallel machines across a wide spectrum of machine sizes (ranging from within a single chip to large-scale servers), the real challenge is how to easily create parallel software to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this paper we propose and evaluate a design for supporting TLS that seamlessly scales to any machine size because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on both single-chip multiprocessors and on larger-scale machines where communication latencies are twenty times larger.

ACM Transactions on Computer Systems | 2005

The STAMPede approach to thread-level speculation

J. Gregory Steffan; Christopher B. Colohan; Antonia Zhai; Todd C. Mowry

Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of write-back invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86% and 56%, four others by more than 8%, and an average across all applications of 16%---confirming that TLS is a promising way to exploit the naturally-multithreaded processing resources of future computer systems.

architectural support for programming languages and operating systems | 2002

Compiler optimization of scalar value communication between speculative threads

Antonia Zhai; Christopher B. Colohan; J. Gregory Steffan; Todd C. Mowry

While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of program performance under TLS, which is stalls due to forwarding scalar values between threads that would otherwise cause frequent data dependences. We present and evaluate dataflow algorithms for three increasingly-aggressive instruction scheduling techniques that reduce the critical forwarding path introduced by the synchronization associated with this data forwarding. In addition, we contrast our compiler techniques with related hardware-only approaches. With our most aggressive compiler and hardware techniques, we improve performance under TLS by 6.2-28.5% for 6 of 14 applications, and by at least 2.7% for half of the other applications.

high-performance computer architecture | 2002

Improving value communication for thread-level speculation

J.G. Steffan; Christopher B. Colohan; Antonia Zhai; Todd C. Mowry

Thread-level speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this paper, we show that the key to good performance ties in the three different ways to communicate a value between speculative threads: speculation, synchronization and prediction. The difficult part is deciding how and when to apply each method. This paper shows how we can apply value prediction, dynamic synchronization and hardware instruction prioritization to improve value communication and hence performance in several SPECint benchmarks that have been automatically transformed by our compiler to exploit TLS. We find that value prediction can be effective when properly throttled to avoid the high costs of mis-prediction, while most of the gains of value prediction can be more easily achieved by exploiting silent stores. We also show that dynamic synchronization is quite effective for most benchmarks, while hardware instruction prioritization is not. Overall, we find that these techniques have great potential for improving the performance of TLS.

symposium on code generation and optimization | 2004

Compiler optimization of memory-resident value communication between speculative threads

Antonia Zhai; Christopher B. Colohan; J. Gregory Steffan; Todd C. Mowry

Efficient inter-thread value communication is essential for improving performance in thread-level speculation (TLS). Although several mechanisms for improving value communication using hardware support have been proposed, there is relatively little work on exploiting the potential of compiler optimization. Building on recent research on compiler optimization of scalar value communication between speculative threads, we propose compiler techniques for the optimization of memory-resident values. In TLS, data dependences through memory-resident values are tracked by the underlying hardware and preserved by re-executing any speculative thread that violates a dependence; however, re-execution incurs a large performance penalty and should be used only to resolve data dependences that are infrequent. In contrast, value communication for frequently-occurring data dependences must be very efficient. We propose using the compiler to first identify frequently-occurring memory-resident data dependences, then insert synchronization for communicating values to preserve these dependences. We find that by synchronizing frequently-occurring data dependences we can significantly improve the efficiency of parallel execution. A comparison between compiler-inserted and hardware-inserted memory synchronization reveals that the two techniques are complementary, with each technique benefitting different benchmarks.

international symposium on computer architecture | 2006

Tolerating Dependences Between Large Speculative Threads Via Sub-Threads

Christopher B. Colohan; Anastassia Ailamaki; J. Gregory Steffan; Todd C. Mowry

Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and have minimal dependences between them. Recent work has shown that TLS can offer compelling performance improvements for database workloads, but only when targeting much larger speculative threads of more than 50,000 dynamic instructions per thread, with many frequent data dependences between them. To support such large and dependent speculative threads, hardware must be able to buffer the additional speculative state, and must also address the more challenging problem of tolerating the resulting cross-thread data dependences In this paper we present hardware support for large speculative threads that integrates several previous proposals for TLS hardware. We also introduce support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing speculative execution. When speculation fails due to a violated data dependence, with sub-threads the failed thread need only rewind to the checkpoint of the appropriate sub-thread rather than rewinding to the start of execution; this significantly reduces the cost of mis-speculation. We evaluate our hardware support for large and dependent speculative threads in the database domain and find that the transaction response time for three of the five transactions from TPC-C (on a simulated 4- processor chip-multiprocessor) speedup by a factor of 1.9 to 2.9.

ACM Transactions on Architecture and Code Optimization | 2008

Compiler and hardware support for reducing the synchronization of speculative threads

Antonia Zhai; J. Gregory Steffan; Christopher B. Colohan; Todd C. Mowry

Thread-level speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this article, we focus on one important limitation of program performance under TLS, which stalls as a result of synchronizing and forwarding scalar values between speculative threads that would otherwise cause frequent data dependences and, hence, failed speculation. Using SPECint benchmarks that have been automatically transformed by our compiler to exploit TLS, we present, evaluate in detail, and compare both compiler and hardware techniques for improving the communication of scalar values. We find that through our dataflow algorithms for three increasingly aggressive instruction scheduling techniques, the compiler can drastically reduce the critical forwarding path introduced by the synchronization and forwarding of scalar values. We also show that hardware techniques for reducing synchronization can be complementary to compiler scheduling, but that the additional performance benefits are minimal and are generally not worth the cost.

IEEE Transactions on Parallel and Distributed Systems | 2007

CMP Support for Large and Dependent Speculative Threads

Christopher B. Colohan; A. C. Ailamaki; J. G. Steffan; T. C. Mowry

Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and which have minimal dependences between them. However, recent work has shown that TLS can offer compelling performance improvements when targeting much larger speculative threads of more than 50,000 dynamic instructions per thread with many frequent data dependences between them. To support such large and dependent speculative threads, the hardware must be able to buffer the additional speculative state and must also address the more challenging problem of tolerating the resulting cross-thread data dependences. In this paper, we present a chip-multiprocessor (CMP) support for large speculative threads that integrates several previous proposals for the TLS hardware. We also present a support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing speculative execution. Through an evaluation that exploits the proposed hardware support in the database domain, we find that the transaction response time for three of the five transactions from TPC-C (on a simulated four-processor chip-multiprocessor) speed up by a factor of 1.9 to 2.9.

ACM Transactions on Computer Systems | 2008

Incrementally parallelizing database transactions with thread-level speculation

Christopher B. Colohan; Anastassia Ailamaki; J. Gregory Steffan; Todd C. Mowry

With the advent of chip multiprocessors, exploiting intratransaction parallelism in database systems is an attractive way of improving transaction performance. However, exploiting intratransaction parallelism is difficult for two reasons: first, significant changes are required to avoid races or conflicts within the DBMS; and second, adding threads to transactions requires a high level of sophistication from transaction programmers. In this article we show how dividing a transaction into speculative threads solves both problems—it minimizes the changes required to the DBMS, and the details of parallelization are hidden from the transaction programmer. Our technique requires a limited number of small, localized changes to a subset of the low-level data structures in the DBMS. Through this method of incrementally parallelizing transactions, we can dramatically improve performance: on a simulated four-processor chip-multiprocessor, we improve the response time by 44--66% for three of the five TPC-C transactions, assuming the availability of idle processors.

Archive | 2015