Lance Hammond | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lance Hammond is active.

Explore More

Publication

Featured researches published by Lance Hammond.

architectural support for programming languages and operating systems | 1996

The case for a single-chip multiprocessor

Kunle Olukotun; Basem A. Nayfeh; Lance Hammond; Kenneth G. Wilson; Kunyung Chang

Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows that in advanced technologies it is possible to implement a single-chip multiprocessor in the same area as a wide issue superscalar processor. We find that for applications with little parallelism the performance of the two microarchitectures is comparable. For applications with large amounts of parallelism at both the fine and coarse grained levels, the multiprocessor microarchitecture outperforms the superscalar architecture by a significant margin. Single-chip multiprocessor architectures have the advantage in that they offer localized implementation of a high-clock rate processor for inherently sequential applications and low latency interprocessor communication for parallel applications.

international symposium on computer architecture | 2004

Transactional Memory Coherence and Consistency

Lance Hammond; Vicky Wong; Michael K. Chen; Brian D. Carlstrom; John D. Davis; Ben Hertzberg; Manohar K. Prabhu; Honggo Wijaya; Christos Kozyrakis; Kunle Olukotun

In this paper, we propose a new shared memory model: transactional memory coherence and consistency (TCC). TCC provides a model in which atomic transactions are always the basic unit of parallel work, communication, memory coherence, and memory reference consistency. TCC greatly simplifies parallel software by eliminating the need for synchronization using conventional locks and semaphores, along with their complexities. TCC hardware must combine all writes from each transaction region in a program into a single packet and broadcast this packet to the permanent shared memory state atomically as a large block. This simplifies the coherence hardware because it reduces the need for small, low-latency messages and completely eliminates the need for conventional snoopy cache coherence protocols, as multiple speculatively written versions of a cache line may safely coexist within the system. Meanwhile, automatic, hardware-controlled rollback of speculative transactions resolves any correctness violations that may occur when several processors attempt to read and write the same data simultaneously. The cost of this simplified scheme is higher interprocessor bandwidth. To explore the costs and benefits of TCC, we study the characteristics of an optimal transaction-based memory system, and examine how different design parameters could affect the performance of real systems. Across a spectrum of applications, the TCC model itself did not limit available parallelism. Most applications are easily divided into transactions requiring only small write buffers, on the order of 4-8 KB. The broadcast requirements of TCC are high, but are well within the capabilities of CMPs and small-scale SMPs with high-speed interconnects.

architectural support for programming languages and operating systems | 1998

Data speculation support for a chip multiprocessor

Lance Hammond; Mark Willey; Kunle Olukotun

Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for threadlevel speculation on the Hydra chip multiprocessor (CMP). The support consists of a number of software speculation control handlers and modifications to the shared secondary cache memory system of the CMP This support is evaluated using five representative integer applications. Our results show that the speculative support is only able to improve performance when there is a substantial amount of medium--grained loop-level parallelism in the application. When the granularity of parallelism is too small or there is little inherent parallelism in the application, the overhead of the software handlers overwhelms any potential performance benefits from speculative-thread parallelism. Overall, thread-level speculation still appears to be a promising approach for expanding the class of applications that can be automatically parallelized, but more hardware intensive implementations for managing speculation control are required to achieve performance improvements on a wide class of integer applications.

IEEE Micro | 2000

The Stanford Hydra CMP

Lance Hammond; Benedict A. Hubbert; M. Siu; Manohar K. Prabhu; M. Chen; K. Olukolun

The Hydra chip multiprocessor (CMP) integrates four MIPS-based processors and their primary caches on a single chip together with a shared secondary cache. A standard CMP offers implementation and performance advantages compared to wide-issue superscalar designs. However, it must be programmed with a more complicated parallel programming model to obtain maximum performance. To simplify parallel programming, the Hydra CMP supports thread-level speculation and memory renaming, a paradigm that allows performance similar to a uniprocessor of comparable die area on integer programs. This article motivates the design of a CMP, describes the architecture of the Hydra design with a focus on its speculative thread support, and describes our prototype implementation. Chip multiprocessors offer an economical, scalable architecture for future microprocessors. Thread-level speculation support allows them to speed up past software.

ACM Queue | 2005

The Future of Microprocessors

Kunle Olukotun; Lance Hammond

The performance of microprocessors that power modern computers has continued to increase exponentially over the years for two main reasons. First, the transistors that are the heart of the circuits in all processors and memory chips have simply become faster over time on a course described by Moore’s law, and this directly affects the performance of processors built with those transistors. Moreover, actual processor performance has increased faster than Moore’s law would predict, because processor designers have been able to harness the increasing numbers of transistors available on modern chips to extract more parallelism from software.

architectural support for programming languages and operating systems | 2004

Programming with transactional coherence and consistency (TCC)

Lance Hammond; Brian D. Carlstrom; Vicky Wong; Ben Hertzberg; Michael K. Chen; Christos Kozyrakis; Kunle Olukotun

Transactional Coherence and Consistency (TCC) offers a way to simplify parallel programming by executing all code within transactions. In TCC systems, transactions serve as the fundamental unit of parallel work, communication and coherence. As each transaction completes, it writes all of its newly produced state to shared memory atomically, while restarting other processors that have speculatively read stale data. With this mechanism, a TCC-based system automatically handles data synchronization correctly, without programmer intervention. To gain the benefits of TCC, programs must be decomposed into transactions. We describe two basic programming language constructs for decomposing programs into transactions, a loop conversion syntax and a general transaction-forking mechanism. With these constructs, writing correct parallel programs requires only small, incremental changes to correct sequential programs. The performance of these programs may then easily be optimized, based on feedback from real program execution, using a few simple techniques.

international symposium on computer architecture | 1996

Evaluation of Design Alternatives for a Multiprocessor Microprocessor

Basem A. Nayfeh; Lance Hammond; Kunle Olukotun

In the future, advanced integrated circuit processing and packaging technology will allow for several design options for multiprocessor microprocessors. In this paper we consider three architectures: shared-primary cache, shared-secondary cache, and shared-memory. We evaluate these three architectures using a complete system simulation environment which models the CPU, memory hierarchy and I/O devices in sufficient detail to boot and run a commercial operating system. Within our simulation environment, we measure performance using representative hand and compiler generated parallel applications, and a multiprogramming workload. Our results show that when applications exhibit fine-grained sharing, both shared-primary and shared-secondary architectures perform similarly when the full costs of sharing the primary cache are included.

international conference on supercomputing | 1999

Improving the performance of speculatively parallel applications on the Hydra CMP

Kunle Olukotun; Lance Hammond; Mark Willey

Abstract : Hydra is a chip multiprocessor (CMP) with integrated support for thread-level speculation. Thread-level speculation provides a way to parallelize sequential programs without the need for data dependence analysis or synchronization. This makes it possible to parallelize applications for which static memory dependence analysis is difficult or impossible. While performance of the baseline Hydra system on applications with significant amounts of medium to large grain parallelism is good, the performance on integer applications with fine-grained parallelism can be poor. In this paper, we describe a collection of software and hardware techniques for improving speculation performance of the Hydra CMP. These techniques focus on reducing the overheads associated with speculation and improving the speculation behavior of the applications using code restructuring. When these techniques are applied to a set of ten integer, multimedia and floating-point bench-marks, significant performance improvements result.

international symposium on microarchitecture | 2004

Transactional coherence and consistency: simplifying parallel hardware and software

Lance Hammond; Brian D. Carlstrom; Vicky Wong; Michael K. Chen; C. Koryrakis; Kunle Olukotun

Transactional coherence and consistency (TCC) simplifies parallel hardware and software design by eliminating the need for conventional cache coherence and consistency models and letting programmers parallelize a wide range of applications with a simple, lock-free transactional model. TCC eases both parallel programming and parallel architecture design by relying on programmer-defined transactions as the basic unit of parallel work, communication, memory coherence, and memory consistency

international conference on parallel architectures and compilation techniques | 2005

Characterization of TCC on chip-multiprocessors

Austen McDonald; JaeWoong Chung; Hassan Chafi; Chi Cao Minh; Brian D. Carlstrom; Lance Hammond; Christos Kozyrakis; Kunle Olukotun

Transactional coherence and consistency (TCC) is a novel coherence scheme for shared memory multiprocessors that uses programmer-defined transactions as the fundamental unit of parallel work, synchronization, coherence, and consistency. TCC has the potential to simplify parallel program development and optimization by providing a smooth transition from sequential to parallel programs. In this paper, we study the implementation of TCC on chip-multiprocessors (CMPs). We explore design alternatives such as the granularity of state tracking, double-buffering, and write-update and write-invalidate protocols. Furthermore, we characterize the performance of TCC in comparison to conventional snoopy cache coherence (SCC) using parallel applications optimized for each scheme. We conclude that the two coherence schemes perform similarly, with each scheme having a slight advantage for some applications. The bandwidth requirements of TCC are slightly higher but well within the capabilities of CMP systems. Also, we find that overflow of speculative state can be effectively handled by a simple victim cache. Our results suggest TCC can provide its programming advantages without compromising the performance expected from well-tuned parallel applications.

Explore More