Kevin E. Moore | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kevin E. Moore is active.

Explore More

Publication

Featured researches published by Kevin E. Moore.

ACM Sigarch Computer Architecture News | 2005

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

Milo M. K. Martin; Daniel J. Sorin; Bradford M. Beckmann; Michael R. Marty; Min Xu; Alaa R. Alameldeen; Kevin E. Moore; Mark D. Hill; David A. Wood

The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads [3]. To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven Multiprocessor Simulator (GEMS) Toolset, release 1.0, under GNU GPL [9].

international symposium on computer architecture | 2007

Performance pathologies in hardware transactional memory

Jayaram Bobba; Kevin E. Moore; Haris Volos; Luke Yen; Mark D. Hill; Michael M. Swift; David A. Wood

Transactional memory is a promising approach to ease parallel programming. Hardware transactional memory system designs reflect choices along three key design dimensions: conflict detection, version management, and conflict resolution. The authors identify a set of performance pathologies that could degrade performance in proposed HTM designs. Improving conflict resolution could eliminate these pathologies so designers can build robust HTM systems.

architectural support for programming languages and operating systems | 2006

Supporting nested transactional memory in logTM

Michelle J. Moravan; Jayaram Bobba; Kevin E. Moore; Luke Yen; Mark D. Hill; Ben Liblit; Michael M. Swift; David A. Wood

Nested transactional memory (TM) facilitates software composition by letting one module invoke another without either knowing whether the other uses transactions. Closed nested transactions extend isolation of an inner transaction until the toplevel transaction commits. Implementations may flatten nested transactions into the top-level one, resulting in a complete abort on conflict, or allow partial abort of inner transactions. Open nested transactions allow a committing inner transaction to immediately release isolation, which increases parallelism and expressiveness at the cost of both software and hardware complexity.This paper extends the recently-proposed flat Log-based Transactional Memory (LogTM) with nested transactions. Flat LogTM saves pre-transaction values in a log, detects conflicts with read (R) and write (W) bits per cache block, and, on abort, invokes a software handler to unroll the log. Nested LogTM supports nesting by segmenting the log into a stack of activation records and modestly replicating R/W bits. To facilitate composition with nontransactional code, such as language runtime and operating system services, we propose escape actions that allow trusted code to run outside the confines of the transactional memory system.

IEEE Computer | 2003

Simulating a

Alaa R. Alameldeen; Milo M. K. Martin; Carl J. Mauer; Kevin E. Moore; Min Xu; Mark D. Hill; David A. Wood; Daniel J. Sorin

As dependence on database management systems and Web servers increases, so does the need for them to run reliably and efficiently-goals that rigorous simulations can help achieve. Execution-driven simulation models system hardware. These simulations capture actual program behavior and detailed system interactions. The authors have developed a simulation methodology that uses multiple simulations, pays careful attention to the effects of scaling on workload behavior, and extends Virtutech ABs Simics full system functional simulator with detailed timing models. The Wisconsin Commercial Workload Suite contains scaled and tuned benchmarks for multiprocessor servers, enabling full-system simulations to run on the PCs that are routinely available to researchers.

architectural support for programming languages and operating systems | 2000

2M commercial server on a

Milo M. K. Martin; Daniel J. Sorin; Anastassia Ailamaki; Alaa R. Alameldeen; Ross M. Dickson; Carl J. Mauer; Kevin E. Moore; Manoj Plakal; Mark D. Hill; David H. Wood

Symmetric muultiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the data is found in another processors cache rather than in memory. SMPs are optimized for this case by using snooping protocols that broadcast address transactions to all processors. Conversely, directory-based shared-memory systems must indirectly locate the owner and sharers through a directory, resulting in larger average miss latencies.This paper proposes timestamp snooping, a technique that allows SMPs to i) utilize high-speed switched interconnection networks and ii) exploit physical locality by delivering address transactions to processors and memories without regard to order. Traditional snooping requires physical ordering of transactions. Timestamp snooping works by processing address transactions in a logical order. Logical time is maintained by adding a few bits per address transaction and having network switches perform a handshake to ensure on-time delivery. Processors and memories then reorder transactions based on their timestamps to establish a total order.We evaluate timestamp snooping with commercial workloads on a 16-processor SPARC system using the Simics full-system simulator. We simulate both an indirect (butterfly) and a direct (torus) network design. For OLTP, DSS, web serving, web searching, and one scientific application, timestamp snooping with the butterfly network runs 6-28% faster than directories, at a cost of 13-43% more link traffic. Similarly, with the torus network, timestamp snooping runs 6-29% faster for 17-37% more link traffic. Thus, timestamp snooping is worth considering when buying more interconnect bandwidth is easier than reducing interconnect latency.

high-performance computer architecture | 2003

2K PC

Martin Karlsson; Kevin E. Moore; Erik Hagersten; David A. Wood

In this paper, we present a detailed characterization of the memory system, behavior of ECperf and SPECjbb using both commercial server hardware and Simics full-system simulation. We find that the memory footprint and primary working sets of these workloads are small compared to other commercial workloads (e.g. on-line transaction processing), and that a large fraction of the working sets are shared between processors. We observed two key differences between ECperf and SPECjbb that highlight the importance of isolating the behavior of the middle tier. First, ECperf has a larger instruction footprint, resulting in much higher miss rates for intermediate-size instruction caches. Second, SPECjbbs data set size increases linearly as the benchmark scales up, while ECperfs remains roughly constant. This difference can lead to opposite conclusions on the design of multiprocessor memory systems, such as the utility of moderate sized (i.e. 1 MB) shared caches in a chip multiprocessor.

IEEE Micro | 2008

Timestamp snooping: an approach for extending SMPs

Jayaram Bobba; Kevin E. Moore; Haris Volos; Luke Yen; Mark D. Hill; Michael M. Swift; David A. Wood

international conference on parallel processing | 2005

Memory system behavior of Java-based middleware

Martin Karlsson; Erik Hagersten; Kevin E. Moore; David A. Wood

Java-based middleware is a rapidly growing workload for high-end server processors, particularly chip multiprocessors (CMP). To help architects design future microprocessors to run this important new workload, we provide a detailed characterization of two popular Java server benchmarks, ECperf and SPECjbb2000. We first estimate the amount of instruction-level parallelism in these workloads by simulating a very wide issue processor with perfect caches and perfect branch predictors. We then identify performance bottlenecks for these workloads on a more realistic processor by selectively idealizing individual processor structures. Finally, we combine our findings on available ILP in Java middleware with results from previous papers that characterize the availibility of TLP to investigate the optimal balance between ILP and TLP in CMPs. We find that, like other commercial workloads, Java middleware has only a small amount of instruction-level parallelism, even when run on very aggressive processors. When run on processors resembling currently available processors, the performance of Java middleware is limited by frequent traps, address translation and stalls in the memory system. We find that SPECjbb2000 differs from ECperf in two meaningful ways: (1) the performance of ECperf is affected much more by cache and TLB misses during instruction fetch and (2) SPECjbb2000 has more memory-level parallelism.

high-performance computer architecture | 2006