Susan J. Eggers | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Susan J. Eggers is active.

Explore More

Publication

Featured researches published by Susan J. Eggers.

international symposium on computer architecture | 1995

Simultaneous multithreading: maximizing on-chip parallelism

Dean M. Tullsen; Susan J. Eggers; Henry M. Levy

This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalars multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited in their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multi-threading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources. While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.

international symposium on microarchitecture | 1997

Simultaneous multithreading: a platform for next-generation processors

Susan J. Eggers; Joel S. Emer; H.M. Leby; Jeffrey Lo; Rebecca L. Stamm; Dean M. Tullsen

Simultaneous multithreading is a processor design which consumes both thread-level and instruction-level parallelism. In SMT processors, thread-level parallelism can come from either multithreaded, parallel programs or individual, independent programs in a multiprogramming workload. Instruction-level parallelism comes from each single program or thread. Because it successfully (and simultaneously) exploits both types of parallelism, SMT processors use resources more efficiently, and both instruction throughput and speedups are greater.

international symposium on computer architecture | 1985

Implementing a cache consistency protocol

Randy H. Katz; Susan J. Eggers; David A. Wood; C. L. Perkins; Robert G. Sheldon

We present an ownership-based multiprocessor cache consistency protocol, designed for implementation by a single chip VLSI cache controller. The protocol and its VLSI realization are described in some data, to emphasize the important implementation issues, in particular, the controller critical sections and the inter- and intra-cache interlocks needed to maintain cache consistency. The design has been carried through to layout in a P-Well CMOS technology.

programming language design and implementation | 1996

Fast, effective dynamic compilation

Joel Auslander; Matthai Philipose; Craig Chambers; Susan J. Eggers; Brian N. Bershad

Dynamic compilation enables optimization based on the values of invariant data computed at run-time. Using the values of these run-time constants, a dynamic compiler can eliminate their memory loads, perform constant propagation and folding, remove branches they determine, and fully unroll loops they bound. However, the performance benefits of the more efficient, dynamically-compiled code are offset by the run-time cost of the dynamic compile. Our approach to dynamic compilation strives for both fast dynamic compilation and high-quality dynamically-compiled code: the programmer annotates regions of the programs that should be compiled dynamically; a static, optimizing compiler automatically produces pre-optimized machine-code templates, using a pair of dataflow analyses that identify which variables will be constant at run-time; and a simple, dynamic compiler copies the templates, patching in the computed values of the run-time constants, to produce optimized, executable code. Our work targets general- purpose, imperative programming languages, initially C. Initial experiments applying dynamic compilation to C programs have produced speedups ranging from 1.2 to 1.8.

architectural support for programming languages and operating systems | 1989

The effect of sharing on the cache and bus performance of parallel programs

Susan J. Eggers; Randy H. Katz

Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization. Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.

international symposium on computer architecture | 1989

Evaluating The Performance Of Four Snooping Cache Coherency Protocols

Susan J. Eggers; Randy H. Katz

Write-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as block size increases; and large cache sizes will hurt write-broadcast. Read-broadcast and competitive snooping extensions to the protocols have been proposed to solve each problem. Our results indicate that the benefits of the extensions are limited. Read-broadcast reduces the number of invalidation misses, but at a high cost in processor lockout from the cache. The net effect can be an increase in total execution cycles. Competitive snooping benefits only those programs with high per-processor locality of reference to shared data. For programs characterized by inter-processor contention for shared addresses, competitive snooping can degrade performance by causing a slight increase in bus utilization and total execution time.

international symposium on computer architecture | 1986

An in-cache address translation mechanism

David A. Wood; Susan J. Eggers; Garth A. Gibson; Mark D. Hill; J. M. Pendleton

In the design of SPUR, a high-performance multiprocessor workstation, the use of large caches and hardware-supported cache consistency suggests a new approach to virtual address translation. By performing translation in each processors virtually-tagged cache, the need for separate translation lookaside buffers (TLBs) is eliminated. Eliminating the TLB substantially reduces the hardware cost and complexity of the translation mechanism and eliminates the translation consistency problem. Trace-driven simulations show that normal cache behavior is only minimally affected by caching page table entries, and that in many cases, using a separate device would actually reduce system performance.

partial evaluation and semantic-based program manipulation | 1997

Annotation-directed run-time specialization in C

Brian Grant; Markus Mock; Matthai Philipose; Craig Chambers; Susan J. Eggers

We present the design of a dynamic compilation system for C. Directed by a few declarative user annotations specifying where and on what dynamic compilation is to take place, a binding time analysis computes the set of run-time constants at each program point in each annotated procedures control flow graph; the analysis supports program-point-specific polyvariant division and specialization. The analysis results guide the construction of a specialized run-time specializer for each dynamically compiled region; the specializer supports various caching strategies for managing dynamically generated code and supports mixes of speculative and demand-driven specialization of dynamic branch successors. Most of the key cost/benefit trade-offs in the binding time analysis and the run-time specialize are open to user control through declarative policy annotations. Our design is being implemented in the context of art existing optimizing compiler.

architectural support for programming languages and operating systems | 1994

The effectiveness of multiple hardware contexts

Radhika Thekkath; Susan J. Eggers

Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working set can have a negative effect on cache conflict misses. In this paper we evaluate the two phenomena together, examining their combined effect on execution time. The usefulness of multiple hardware contexts depends on: program data locality, cache organization and degree of multiprocessing. Multiple hardware contexts are most effective on programs that have been optimized for data locality. For these programs, execution time dropped with increasing contexts, over widely varying architectures. With unoptimized applications, multiple contexts had limited value. The best performance was seen with only two contexts, and only on uniprocessors and small multiprocessors. The behavior of the unoptimized applications changed more noticeably with variations in cache associativity and cache hierarchy, unlike the optimized programs. As a mechanism for exploiting program parallelism, an additional processor is clearly better than another context. However, there were many configurations for which the addition of a few hardware contexts brought as much or greater performance than a larger multiprocessor with fewer than the optimal number of contexts.

architectural support for programming languages and operating systems | 2000

An analysis of operating system behavior on a simultaneous multithreaded architecture

Joshua Redstone; Susan J. Eggers; Henry M. Levy

This paper presents the first analysis of operating system execution on a simultaneous multithreaded (SMT) processor. While SMT has been studied extensively over the past 6 years, previous research has focused entirely on user-mode execution. However, many of the applications most amenable to multithreading technologies spend a significant fraction of their time in kernel code. A full understanding of the behavior of such workloads therefore requires execution and measurement of the operating system, as well as the application itself.To carry out this study, we (1) modified the Digital Unix 4.0d operating system to run on an SMT CPU, and (2) integrated our SMT Alpha instruction set simulator into the SimOS simulator to provide an execution environment. For an OS-intensive workload, we ran the multithreaded Apache Web server on an 8-context SMT. We compared Apaches user- and kernel-mode behavior to a standard multiprogrammed SPECInt workload, and compared the SMT processor to an out-of-order superscalar running both workloads. Overall, our results demonstrate the microarchitectural impact of an OS-intensive workload on an SMT processor and provide insight into the OS demands of the Apache Web server. The synergy between the SMT processor and Web and OS software produced a greater throughput gain over superscalar execution than seen on any previously examined workloads, including commercial databases and explicitly parallel programs.

Explore More