Robert J. Stets | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Robert J. Stets is active.

Explore More

Publication

Featured researches published by Robert J. Stets.

international symposium on computer architecture | 2000

Piranha: a scalable architecture based on single-chip multiprocessing

Luiz André Barroso; Kourosh Gharachorloo; Robert McNamara; Andreas Nowatzyk; Shaz Qadeer; Barton Sano; Scott Smith; Robert J. Stets; Ben Verghese

This paper describes the Piranha system, a research prototype being developed at Compaq that aggressively exploits chip multiprocessing by integrating eight simple Alpha processor cores along with a two-level cache hierarchy onto a single chip. Piranha also integrates further on-chip functionality to allow for scalable multiprocessor configurations to be built in a glueless and modular fashion. The use of simple processor cores combined with an industry-standard ASIC design methodology allow us to complete our prototype within a short time-frame, with a team size and investment that are an order of magnitude smaller than that of a commercial microprocessor. Our detailed simulation results show that while each Piranha processor core is substantially slower than an aggressive next-generation processor, the integration of eight cores onto a single chip allows Piranha to outperform next-generation processors by up to 2.9 times (on a per chip basis) on important workloads such as OLTP. This performance advantage can approach a factor of five by using full-custom instead of ASIC logic. In addition to exploiting chip multiprocessing, the Piranha prototype incorporates several other unique design choices including a shared second-level cache with no inclusion, a highly optimized cache coherence protocol, and a novel I/O architecture.

symposium on operating systems principles | 1997

Cashmere-2L: software coherent shared memory on a clustered remote-write network

Robert J. Stets; Sandhya Dwarkadas; Nikos Hardavellas; Galen C. Hunt; Leonidas I. Kontothanassis; Srinivasan Parthasarathy; Michael L. Scott

Low-latency remote-write networks, such as DECs Memory Channel, provide the possibility of transparent, inexpensive, large-scale shared-memory parallel computing on clusters of shared memory multiprocessors (SMPs). The challenge is to take advantage of hardware shared memory for sharing within an SMP, and to ensure that software overhead is incurred only when actively sharing data across SMPs in the cluster. In this paper, we describe a two-level software coherent shared memory system-Cashmere-2L-that meets this challenge. Cashmere-2L uses hardware to share memory within a node, while exploiting the Memory Channels remote-write capabilities to implement moderately lazy release consistency with multiple concurrent writers, directories, home nodes, and page-size coherence blocks across nodes. Cashmere-2L employs a novel coherence protocol that allows a high level of asynchrony by eliminating global directory locks and the need for TLB shootdown. Remote interrupts are minimized by exploiting the remote-write capabilities of the Memory Channel network. Cashmere-2L currently runs on an 8-node, 32-processor DEC AlphaServer system. Speedups range from 8 to 31 on 32 processors for our benchmark suite, depending on the applications characteristics. We quantify the importance of our protocol optimizations by comparing performance to that of several alternative protocols that do not share memory in hardware within an SMP, and require more synchronization. In comparison to a one-level protocol that does not share memory in hardware within an SMP, Cashmere-2L improves performance by up to 46%.

international symposium on computer architecture | 1997

VM-based shared memory on low-latency, remote-memory-access networks

Leonidas I. Kontothanassis; Galen C. Hunt; Robert J. Stets; Nikolaos Hardavellas; Michal Cierniak; Srinivasan Parthasarathy; Wagner Meira; Sandhya Dwarkadas; Michael L. Scott

Recent technological advances have produced network interfaces that provide users with very low-latency access to the memory of remote machines. We examine the impact of such networks on the implementation and performance of software DSM. Specifically, we compare two DSM systems---Cashmere and TreadMarks---on a 32-processor DEC Alpha cluster connected by a Memory Channel network.Both Cashmere and TreadMarks use virtual memory to maintain coherence on pages, and both use lazy, multi-writer release consistency. The systems differ dramatically, however, in the mechanisms used to track sharing information and to collect and merge concurrent updates to a page, with the result that Cashmere communicates much more frequently, and at a much finer grain.Our principal conclusion is that low-latency networks make DSM based on fine-grain communication competitive with more coarse-grain approaches, but that further hardware improvements will be needed before such systems can provide consistently superior performance. In our experiments, Cashmere scales slightly better than TreadMarks for applications with false sharing. At the same time, it is severely constrained by limitations of the current Memory Channel hardware. In general, performance is better for TreadMarks.

high-performance computer architecture | 1999

Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory

Sandhya Dwarkadas; Kourosh Gharachorloo; Leonidas I. Kontothanassis; Daniel J. Scales; Michael L. Scott; Robert J. Stets

Symmetric multiprocessors (SMPs) connected with low-latency networks provide attractive building blocks for software distributed shared memory systems. Two distinct approaches have been used: the fine-grain approach that instruments application loads and stores to support a small coherence granularity, and the coarse-grain approach based on virtual memory hardware that provides coherence at a page granularity. Fine-grain systems offer a simple migration path for applications developed on hardware multiprocessors by supporting coherence protocols similar to those implemented in hardware. On the other hand, coarse-grain systems can potentially provide higher performance through more optimized protocols and larger transfer granularities, while avoiding instrumentation overheads. Numerous studies have examined each approach individually, but major differences in experimental platforms and applications make comparison of the approaches difficult. This paper presents a detailed comparison of two mature systems, Shasta and Cashmere, representing the fine- and coarse-grain approaches, respectively. Both systems are tuned to run on the same commercially available, state-of-the-art cluster of AlphaServer SMPs connected via a Memory Channel network. As expected, our results show that Shasta provides robust performance for applications tuned for hardware multiprocessors, and can better tolerate fine-grain synchronization. In contrast, Cashmere is highly sensitive to fine-grain synchronization, but provides a performance edge for applications with coarse-grain behavior. Interestingly, we found that the performance gap between the systems can often be bridged by program modifications that address coherence and synchronization granularity. In addition, our study reveals some unexpected results related to the interaction of current compiler technology with application instrumentation, and the ability of SMP-aware protocols to avoid certain performance disadvantages of coarse-grain approaches.

IEEE Computer | 1999

Component-based APIs for versioning and distributed applications

Robert J. Stets; Galen C. Hunt; Michael L. Scott

Operating system application programming interfaces (APIs) are typically monolithic procedural interfaces that address a single machines requirements. This design limits evolutionary development and complicates application development for distributed systems. Current APIs tend to be large, rigid, and focus on a single host machine. Component-based APIs could solve these problems through strong versioning capabilities and support for distributed applications. Ideally, obsolete API calls should be deleted, and calls with modified semantics (but unmodified parameters and return values) would remain the same. However, since the OS must continue to support legacy applications, obsolete calls cannot be deleted, and new call semantics are best introduced through new calls. In addition, typical OS APIs do not adequately address the needs of distributed applications: they have support for intermachine communication but lack high-level support for accessing remote OS resources. The primary omission is a uniform method for naming remote resources, such as windows, files, and synchronization objects.

international parallel processing symposium | 1999

Cashmere-VLM: Remote memory paging for software distributed shared memory

Sandhya Dwarkadas; Nikolaos Hardavellas; Leonidas I. Kontothanassis; Rishiyur S. Nikhil; Robert J. Stets

Software distributed shared memory (DSM) systems have successfully provided the illusion of shared memory on distributed memory machines. However most software DSM systems use the main memory of each machine as a level in a cache hierarchy, replicating copies of shared data in local memory. Since computer memories tend to be much larger than caches, DSM systems have largely ignored memory capacity issues, assuming there is always enough space in main memory in which to replicate data. Applications that access data that exceeds the capacity available in local memory will page to disk, resulting in reduced performance. We have developed a software DSM system based on Cashmere that takes advantage of system-wide memory resources in order to reduce or eliminate paging overhead. Experimental results on a 4-node, 16-processor AlphaServer system demonstrate the improvement in performance using the enhanced software DSM system for applications with large data sets.

high performance computer architecture | 2000

The effect of network total order, broadcast, and remote-write capability on network-based shared memory computing

Robert J. Stets; Sandhya Dwarkadas; Leonidas I. Kontothanassis; Umit Rencuzogullari; Michael L. Scott

Emerging system-area networks provide a variety of features that can dramatically reduce network communication overhead. In this paper, we evaluate the impact of such features on the implementation of Software Distributed Shared Memory (SDSM), and on the Cashmere system in particular. Cashmere has been implemented on the Compaq Memory Channel network, which supports low-latency messages, protected remote memory writes, in-expensive broadcast, and total ordering of network packets. Our evaluation is based on several Cashmere protocol variants, ranging from a protocol that fully leverages the Memory Channels special features to one that uses the network only for fast messaging. We find that the special features improve performance by 18-44% for three of our applications, but less than 12% for our other seven applications. We also find that home node migration, an optimization available only in the message-based protocol, can improve performance by as much as 67%. These results suggest that for systems of modest size, low latency is much more important for SDSM performance than are remote writes, broadcast, or total ordering. At the same time, results on an emulated 32-node system indicate that broadcast based on remote writes of widely-shared data may improve performance by up to 51% for some applications. If hardware broadcast or multicast facilities can be made to scale, they can be beneficial in future system-area networks.

Scientific Programming | 1999

CRAULc Compiler and run-time integration for adaptation under load[1]This work was supported in part by NSF grants CDA-9401142, CCR-9702466, and CCR-9705594s and an external research grant from Compaq.

Sotiris Ioannidis; Umit Rencuzogullari; Robert J. Stets; Sandhya Dwarkadas

Clusters of workstations provide a cost-effective, high performance parallel computing environment. These environments, however, are often shared by multiple users, or may consist of heterogeneous machines. As a result, parallel applications executing in these environments must operate despite unequal computational resources. For maximum performance, applications should automatically adapt execution to maximize use of the available resources. Ideally, this adaptation should be transparent to the application programmer. In this paper, we present CRAUL (Compiler and Run-Time Integration for Adaptation Under Load), a system that dynamically balances computational load in a parallel application. Our target run-time is software-based distributed shared memory (SDSM). SDSM is a good target for parallelizing compilers since it reduces compile-time complexity by providing data caching and other support for dynamic load balancing. CRAUL combines compile-time support to identify data access patterns with a run-time system that uses the access information to intelligently distribute the parallel workload in loop-based programs. The distribution is chosen according to the relative power of the processors and so as to minimize SDSM overhead and maximize locality. We have evaluated the resulting load distribution in the presence of different types of load - computational, computational and memory intensive, and network load. CRAUL performs within 5-23% of ideal in the presence of load, and is able to improve on naive compiler-based work distribution that does not take locality into account even in the absence of load.Clusters of workstations provide a cost-effective, high performance parallel computing environment. These environments, however, are often shared by multiple users, or may consist of heterogeneous machines. As a result, parallel applications executing in these environments must operate despite unequal computational resources. For maximum performance, applications should automatically adapt execution to maximize use of the available resources. Ideally, this adaptation should be transparent to the application programmer. In this paper, we present CRAUL (Compiler and Run-Time Integration for Adaptation Under Load), a system that dynamically balances computational load in a parallel application. Our target run-time is software-based distributed shared memory (SDSM). SDSM is a good target for parallelizing compilers since it reduces compile-time complexity by providing data caching and other support for dynamic load balancing. CRAUL combines compile-time support to identify data access patterns with a run-time system that uses the access information to intelligently distribute the parallel workload in loop-based programs. The distribution is chosen according to the relative power of the processors and so as to minimize SDSM overhead and maximize locality. We have evaluated the resulting load distribution in the presence of different types of load – computational, computational and memory intensive, and network load. CRAUL performs within 5–23% of ideal in the presence of load, and is able to improve on naive compiler-based work distribution that does not take locality into account even in the absence of load.

Archive | 2003