James Laudon | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James Laudon is active.

Explore More

Publication

Featured researches published by James Laudon.

international symposium on microarchitecture | 2006

Fair Queuing Memory Systems

Kyle J. Nesbit; Nidhi Aggarwal; James Laudon; James E. Smith

We propose and evaluate a multi-thread memory scheduler that targets high performance CMPs. The proposed memory scheduler is based on concepts originally developed for network fair queuing scheduling algorithms. The memory scheduler is fair and provides quality of service (QoS) while improving system performance. On a four processor CMP running workloads containing a mix of applications with a range of memory bandwidth demands, the proposed memory scheduler provides QoS to all of the threads in all of the workloads, improves system performance by an average of 14% (41% in the best case), and reduces the variance in the threads target memory bandwidth utilization from .2 to .0058

international symposium on computer architecture | 2007

Virtual private caches

Kyle J. Nesbit; James Laudon; James E. Smith

Virtual Private Machines (VPM) provide a framework for Quality of Service (QoS) in CMP-based computer systems. VPMs incorporate microarchitecture mechanisms that allow shares of hardware resources to be allocated to executing threads, thus providing applications with an upper bound on execution time regardless of other thread activity. Virtual Private Caches (VPCs) are an important element of VPMs. VPC hardware consists of two major components: the VPC Arbiter, which manages shared cache bandwidth, and the VPC Capacity Manager, which manages the cache storage. Both the VPC Arbiter and VPC Capacity Manager provide minimum service guarantees that, when combined, achieve QoS for the cache subsystem. Simulation-based evaluation shows that conventional cache bandwidth management policies allow concurrently executing threads to affect each other significantly in an uncontrollable manner. The evaluation targets cache bandwidth because the effects of cache capacity sharing have been studied elsewhere. In contrast with the conventional policies, the VPC Arbiter meets its QoS performance objectives on all workloads studied and over a range of allocated bandwidth levels. The VPC Arbiter’s fairness policy, which distributes leftover bandwidth, mitigates the effects of cache preemption latencies, thus ensuring threads a high-degree of performance isolation. Furthermore, the VPC Arbiter eliminates negative bandwidth interference which can improve aggregate throughput and resource utilization.

international conference on parallel architectures and compilation techniques | 2005

Maximizing CMP throughput with mediocre cores

John D. Davis; James Laudon; Kunle Olukotun

In this paper we compare the performance of area equivalent small, medium, and large-scale multithreaded chip multiprocessors (CMTs) using throughput-oriented applications. We use area models based on SPARC processors incorporating these architectural features. We examine CMTs with in-order scalar processor cores, 2-way or 4-way in-order superscalar cores, private primary instruction and data caches, and a shared secondary cache. We explore a large design space, ranging from processor-intensive to cache-intensive CMTs. We use SPEC JBB2000, TPC-C, TPC-W, and XML Test to demonstrate that the scalar simple-core CMTs do a better job of addressing the problems of low instruction-level parallelism and high cache miss rates that dominate Web service middleware and online transaction processing applications. For the best overall CMT performance, smaller cores with lower performance, so called mediocre cores, maximize the total number of CMT cores and outperform CMTs built from larger, higher performance cores.

ACM Sigarch Computer Architecture News | 2005

Performance/Watt: the new server focus

James Laudon

Transaction processing has emerged as the killer application for commercial servers. Most servers are engaged in transactional workloads such as processing search requests, serving middleware, evaluating decisions, managing databases, and powering online commerce. Currently, commercial servers are built from one or more high-performance superscalar processors. However, commercial server applications exhibit high cache miss rates, large memory footprints, and low instruction level parallelism (ILP), which leads to poor utilization on traditional ILP-focused superscalar processors [11]. In addition, these ILP-focused processors have been primarily optimized to deliver maximum performance by employing high clock rates and large amounts of speculation. As a result, we are now at the point where the performance/Watt of subsequent generations of traditional ILP-focused processors on server workloads has been flat [4] or even decreasing. The lack of increase in processor performance/Watt, coupled with the continued decrease in server hardware acquisition costs and likely increases in future power and cooling costs is leading to a situation where total cost of server ownership will soon be predominately determined by power [4]. In this paper, we argue that attacking thread-level parallelism (TLP) via a large number of simple cores on a chip multiprocessor (CMP) leads to much better performance/Watt for server workloads. As a case study, we compare Suns TLP-oriented Niagara processor against the ILP-oriented dual-core Pentium Extreme Edition from Intel, showing that the Niagara processor has a significant performance/Watt advantage for throughput-oriented server applications.

International Journal of Parallel Programming | 2007

The coming wave of multithreaded chip multiprocessors

James Laudon; Lawrence Spracklen

The performance of microprocessors has increased exponentially for over 35xa0years. However, process technology challenges, chip power constraints, and difficulty in extracting instruction-level parallelism are conspiring to limit the performance of future individual processors. To address these limits, the computer industry has embraced chip multiprocessing (CMP), predominately in the form of multiple high-performance superscalar processors on the same die. We explore the trade-off between building CMPs from a few high-performance cores or building CMPs from a large number of lower-performance cores and argue that CMPs built from a larger number of lower-performance cores can provide better performance and performance/Watt on many commercial workloads. We examine two multi-threaded CMPs built using a large number of processor cores: Sun’s Niagara and Niagara 2 processors. We also explore the programming issues for CMPs with large number of threads. The programming model for these CMPs is similar to the widely used programming model for symmetric multiprocessors (SMPs), but the greatly reduced costs associated with communication of data through the on-chip shared secondary cache allows for more fine-grain parallelism to be effectively exploited by the CMP. Finally, we present performance comparisons between Sun’s Niagara and more conventional dual-core processors built from large superscalar processor cores. For several key server workloads, Niagara shows significant performance and even more significant performance/Watt advantages over the CMPs built from traditional superscalar processors.

ACM Sigarch Computer Architecture News | 2005

The RASE (Rapid, Accurate Simulation Environment) for chip multiprocessors

John D. Davis; Cong Fu; James Laudon

We present RASE, a full system high performance simulation methodology for simulating complex server applications and server class chip multiprocessors enabled with fine-grain multithreading (CMTs). RASE combines application knowledge, operating system information, and data access patterns with an instruction stream from a highly-tuned, scalable steady-state benchmark [5] [22] to generate multiple representative instruction streams that can be mapped to a variety of CMT configurations. We use execution-driven simulation to generate instruction streams for M processors and store them as instruction trace files (several billion instructions per processor) that can be post-processed and augmented for larger than M processor system simulation. We use SPEC JBB2000, TPC-C, and an XML server benchmark to compare the performance estimates of RASE to a reference prototype CMT system. By varying M, we find that our trace-driven simulation methodology predicts within 5% of the instructions per cycle (IPC) of the reference hardware for the applications. Without post-processing the traces, in the best cases, the performance prediction accuracy degrades to 20-40% of the real IPC for instruction traces that require a high replication factor.

Archive | 2005