Lawrence Spracklen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lawrence Spracklen is active.

Explore More

Publication

Featured researches published by Lawrence Spracklen.

high-performance computer architecture | 2005

Chip multithreading: opportunities and challenges

Lawrence Spracklen; Santosh G. Abraham

Chip multi-threaded (CMT) processors provide support for many simultaneous hardware threads of execution in various ways, including simultaneous multithreading (SMT) and chip multiprocessing (CMP). CMT processors are especially suited to server workloads, which generally have high levels of thread-level parallelism (TLP). In this paper, we describe the evolution of CMT chips in industry and highlight the pervasiveness of CMT designs in upcoming general-purpose processors. The CMT design space accommodates a range of designs between the extremes represented by the SMT and CMP designs and a variety of attractive design options are currently unexplored. Though there has been extensive research on utilizing multiple hardware threads to speed up single-threaded applications via speculative parallelization, there are many challenges in designing CMT processors, even when sufficient TLP is present. This paper describes some of these challenges including, hot sets, hot banks, speculative prefetching strategies, request prioritization and off-chip bandwidth reduction.

international conference on supercomputing | 2004

Effective stream-based and execution-based data prefetching

Sorin Iacobovici; Lawrence Spracklen; Sudarshan Kadambi; Yuan Chou; Santosh G. Abraham

With processor speeds continuing to outpace the memory subsystem, cache missing memory operations continue to become increasingly important to application performance. In response to this continuing trend, most modern processors now support hardware (HW) prefetchers, which act to reduce the missing loads observed by an application.This paper analyzes the behavior of cache-missing loads in SPEC CPU2000 and highlights the inability of unit and single non-unit stride prefetchers to correctly prefetch for some commonly occurring streams. In response to this analysis, a novel multi-stride prefetcher, that supports streams with up to four distinct strides, is proposed. Performance analysis for SPEC CPU2000 illustrates that the proposed multi-stride prefetcher can outperform current stride prefetchers on several benchmarks; most notably on mcf, lucas and facerec, where it achieves an additional performance gain of up to 57%. Performance of the strided HW prefetchers is also contrasted with another recently proposed prefetch scheme, runahead execution (RAE), and the synergy between the schemes is investigated.

high-performance computer architecture | 2005

Effective instruction prefetching in chip multiprocessors for modern commercial applications

Lawrence Spracklen; Yuan Chou; Santosh G. Abraham

In this paper, we study the instruction cache miss behavior of four modern commercial applications (a database workload, TPC-W, SPECjAppServer2002 and SPECweb99). These applications exhibit high instruction cache miss rates for both the L1 and L2 caches, and a sizable performance improvement can be achieved by eliminating these misses. We show that it is important, not only to address sequential misses, but also misses due to branches and function calls. As a result, we propose an efficient discontinuity prefetching scheme that can be effectively combined with traditional sequential prefetching to address all forms of instruction cache misses. Additionally, with the emergence of chip multiprocessors (CMPs), instruction prefetching schemes must take into account their effect on the shared L2 cache. Specifically aggressive instruction cache prefetching can result in an increase in the number of L2 cache data misses. As a solution, we propose a scheme that does not install prefetches into the L2 cache unless they are proven to be useful. Overall, we demonstrate that the combination of our proposed schemes is successful in reducing the instruction miss rate to only 10%-16% of the original miss rate and results in a 1.08X-1.37X performance improvement for the applications studied.

international symposium on microarchitecture | 2005

Store Memory-Level Parallelism Optimizations for Commercial Applications

Yuan Chou; Lawrence Spracklen; Santosh G. Abraham

This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. The epoch MLP model is used to explain and quantify how these overlaps are affected by various store handling optimizations and by the memory consistency model implemented by the processor. The extent of these overlaps is then translated to off-chip CPI. Experimental results show that store handling optimizations are crucial for mitigating the substantial performance impact of stores in commercial applications. While some previously proposed optimizations, such as store prefetching, are highly effective, they are unable to fully mitigate the performance impact of off-chip store misses and they also leave a performance gap between the stronger and weaker memory consistency models. New optimizations, such as the store miss accelerator, an optimization of hardware scout and a new application of speculative lock elision, are demonstrated to virtually eliminate the impact of off-chip store misses

international symposium on microarchitecture | 2005

Accelerating next-generation public-key cryptosystems on general-purpose CPUs

Hans Eberle; Sheueling Chang Shantz; Vipul Gupta; Nils Gura; Leonard D. Rarick; Lawrence Spracklen

This article describes low-cost techniques for accelerating the ECC and RSA public-key cryptosystems on general-purpose processor architectures. We focus on hardware acceleration of public-key cryptosystems on 64-bit server machines. A prototype based on a Sparc CPU data path shows a clear performance advantage of ECC over RSA.

Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering | 2009

Transparent multi-core cryptographic support on Niagara CMT Processors

James P. Hughes; Gary Morton; Jan Pechanec; Christoph L. Schuba; Lawrence Spracklen; Bhargava Yenduri

How cryptographic functionality has been implemented and made available in application scenarios has evolved over time. Pure software implementations were the obvious first choice, followed by dedicated hardware devices, be it co-processors or hardware accelerators accessible on the main bus.

ieee hot chips symposium | 2009

Sun's 3rd generation on-chip UltraSPARC security accelerator

Lawrence Spracklen

• RF continues UltraSPARC CMT tradition of providing on-chip accelerators • RF includes Suns 3rd generation on-chip security accelerator • RFs accelerator introduces > Additional ciphers, chaining modes and secure hashes > Non-priv fast-path to accelerators • Fast-path eliminates vast majority of overheads associated with offloads > Allows direct interaction between non-priv applications and the accelerators > Improves small object performance by up to 30X • RF provides additional non-priv crypto instructions to help accelerate authenticated-encryption operations • RF builds on the successes of the UltraSPARC T2 and significantly expands the application space which can benefit from the accelerators

modeling, analysis, and simulation on computer and telecommunication systems | 2005

Accurate modeling of aggressive speculation in modern microprocessor architectures

Harit Modi; Lawrence Spracklen; Yuan Chou; Santosh G. Abraham

Computer architects utilize cycle simulators to evaluate microprocessor chip design tradeoffs and estimate performance metrics. Traditionally, cycle simulators are either trace-driven or execution-driven. In this paper, we describe ValueSim, a software layer that is interposed between a cycle simulators and either a functional simulator or a value-enhanced trace. By writing to the ValueSim API, the cycle simulator can run in either trace-driven mode or execution-driven mode, allowing it to exploit the advantages of both approaches. The ValueSim API allows a cycle simulator to accurately model a complete range of aggressive speculative mechanisms developed by computer architects, even in the trace-driven mode. Using ValueSim, we illustrate, for three key commercial applications, the significant underestimation of off-chip bandwidth, queuing delays and cache pollution when modern speculative mechanisms are not accurately modeled, highlighting the importance of accurately modeling these mechanisms in chip multiprocessor designs.

asian solid state circuits conference | 2007

UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC

M. Shah; J. Barren; J. Brooks; Robert T. Golla; G. Grohoski; Nils Gura; R. Hetherington; P. Jordan; M. Luttrell; C. Olson; B. Sana; Denis Sheahan; Lawrence Spracklen; A. Wynn

Archive | 1999