Mark R. Swanson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mark R. Swanson is active.

Explore More

Publication

Featured researches published by Mark R. Swanson.

high-performance computer architecture | 1999

Impulse: building a smarter memory controller

John B. Carter; Wilson C. Hsieh; Leigh Stoller; Mark R. Swanson; Lixin Zhang; Erik Brunvand; Al Davis; Chen-Chi Kuo; Ravindra Kuramkote; Michael A. Parker; Lambert Schaelicke; Terry Tateyama

Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper we describe the design of the Impulse architecture, and show how an Impulse memory system can be used to improve the performance of memory-bound programs. For the NAS conjugate gradient benchmark, Impulse improves performance by 67%. Because it requires no modification to processor, cache, or bus designs, Impulse can be adopted in conventional systems. In addition to scientific applications, we expect that Impulse will benefit regularly strided memory-bound applications of commercial importance, such as database and multimedia programs.

high level parallel programming models and supportive environments | 1998

Making distributed shared memory simple, yet efficient

Mark R. Swanson; Leigh Stoller; John B. Carter

Recent research on distributed shared memory (DSM) has focussed on improving performance by reducing the communication overhead of DSM. Features added include lazy release consistency based coherence protocols and new interfaces that give programmers the ability to hand tune communication. These features have increased DSM performance at the expense of requiring increasingly complex DSM systems or increasingly cumbersome programming. They have also increased the computation overhead of DSM, which has partially offset the communication related performance gains. We chose to implement a simple DSM system, Quarks, with an eye towards hiding most computation overhead while using a very low latency transport layer to reduce the effect of communication overhead. The resulting performance is comparable to that of far more complex DSM systems, such as Treadmarks and Cashmere.

workshop on hot topics in operating systems | 1993

FLEX: a tool for building efficient and flexible systems

John B. Carter; Bryan Ford; Mike Hibler; Ravindra Kuramkote; Jeffrey Law; Jay Lepreau; Douglas B. Orr; Leigh Stoller; Mark R. Swanson

Modern operating systems must support a wide variety of services for a diverse set of users. Designers of these systems face a tradeoff between functionality and performance. Systems like Mach provide a set of general abstractions and attempt to handle every situation, which can lead to poor performance for common cases. Other systems, such as Unix, provide a small set of abstractions that can be made very efficient, at the expense of functionality. We are implementing a flexible system building tool, FLEX, that allows us to support a powerful operating systems interface efficiently by constructing specialized module implementations at runtime. FLEX improves the performance of existing systems by optimizing interprocess communications paths and relocating servers and clients to reduce communications overhead. These facilities improve the performance of Unix system calls on Mach from 20-400%. Furthermore, FLEX can dynamically extend the kernel in a controlled fashion, which gives user programs access to privileged data and devices not envisioned by the original operating system implementor.<<ETX>>

international conference on functional programming | 1988

An implementation of portable standard LISP on the BBN butterfly

Mark R. Swanson; Robert R. Kessler; Gary Lindstrom

An implementation of the Portable Standard Lisp (PSL) on the BBN Butterfly is described. Butterfly PSL is identical, syntactically and semantically, to implementations of PSL currently available on the VAX, Gould, and many 68000-based machines, except for the differences discussed in this paper. The differences include the addition of the future and touch constructs for explicit parallelism and an extension of the fluid binding mechanism to support the multiple environments required by concurrent tasks. As with all other PSL implementations, full compilation to machine code of the basic system and application source code is the normal mode, in contrast to the previous byte-code interpreter efforts. Also discussed are other required changes to the PSL system not visible in the syntax or semantics, e.g., compiler support for the future construct. Finally, the underlying hardware is described, and timings for basic operations and speedup results for two examples are given.

international conference on parallel processing | 1998

ASCOMA: an adaptive hybrid shared memory architecture

Chen-Chi Kuo; John B. Carter; Ravindra Kuramkote; Mark R. Swanson

Scalable shared memory multiprocessors traditionally use either a cache coherent non-uniform memory access (CC-NUMA) or simple cache-only memory architecture (S-COMA) memory architecture. Recently, hybrid architectures that combine aspects of both CC-NUMA and S-COMA have emerged. We present two improvements over other hybrid architectures. The first improvement is a page allocation algorithm that prefers S-COMA pages at low memory pressures. Once the local free page pool is drained, additional pages are mapped in CC-NUMA mode until they suffer sufficient remote misses to warrant upgrading to S-COMA mode. The second improvement is a page replacement algorithm that dynamically backs off the rate of page remappings from CC-NUMA to S-COMA mode at high memory pressure. This design dramatically reduces the amount of kernel overhead and the number of induced cold misses caused by needless thrashing of the page cache. The resulting hybrid architecture is called adaptive S-COMA (AS-COMA). AS-COMA exploits the best of S-COMA and CC-NUMA, performing like an S-COMA machine at low memory pressure and like a CC-NUMA machine at high memory pressure. AS-COMA outperforms CC-NUMA under almost all conditions, and outperforms other hybrid architectures by up to 17% at low memory pressure and up to 90% at high memory pressure.

Lecture Notes in Computer Science | 1998

Memory System Support for Irregular Applications

John B. Carter; Wilson C. Hsieh; Mark R. Swanson; Lixin Zhang; Erik Brunvand; Al Davis; Chen-Chi Kuo; Ravindra Kuramkote; Michael A. Parker; Lambert Schaelicke; Leigh Stoller; Terry Tateyama

Because irregular applications have unpredictable memory access patterns, their performance is dominated by memory behavior. The Impulse configurable memory controller will enable significant performance improvements for irregular applications, because it can be configured to optimize memory accesses on an application-by-application basis. In this paper we describe the optimizations that the Impulse controller supports for sparse matrix-vector product, an important computational kernel, and outline the transformations that the compiler and runtime system must perform to exploit these optimizations.

parallel computing | 1997

Efficient Communication Mechanisms for Cluster Based Parallel Computing

Al Davis; Mark R. Swanson; Michael A. Parker

The key to crafting an effective scalable parallel computing system lies in minimizing the delays imposed by the system. Of particular importance are communications delays, since parallel algorithms must communicate frequently. The communication delay is a system-imposed latency. The existence of relatively inexpensive high performance workstations and emerging high performance interconnect options provide compelling economic motivation to investigate NOW/COW (network/cluster of workstation) architectures. However, these commercial components have been designed for generality. Cluster nodes are connected by longer physical wire paths than found in special-purpose supercomputer systems. Both effects tend to impose intractable latencies on communication. Even larger system-imposed delays result from the overhead of sending and receiving messages. This overhead can come in several forms, including CPU occupancy by protocol and device code as well as interference with CPU access to various levels of the memory hierarchy. Access contention becomes even more onerous when the nodes in the system are themselves symmetric multiprocessors. Additional delays are incurred if the communication mechanism requires processes to run concurrently in order to communicate with acceptable efficiency. This paper presents the approach taken by the Utah Avalanche project which spans user level code, operating system support, and network interface hardware. The result minimizes the constraining effects of latency, overhead, and loosely coupled scheduling that are common characteristics in NOW-based architectures.

international parallel and distributed processing symposium | 1993

Compiling distributed C

Harold Carr; Robert R. Kessler; Mark R. Swanson

Distributed C++ (DC++) is a language for writing parallel applications on loosely coupled distributed systems in C++. Its key idea is to extend the C++ class into 3 categories: gateway classes which act as communication and synchronization entry points between abstract processors, classes whose instances may be passed by value between abstract processors via gateways, and vanilla C++ classes. DC++ code is compiled to C++ code with calls to the DC++ runtime system. The DC++ compiler wraps gateway classes with handle classes so that remote procedure calls are transparent. It adds static variables to value classes and produces code which is used to marshal and unmarshal arguments when these value classes are used in remote procedure calls. Value classes are deep copied and preserve structure sharing. This paper shows DC++ compilation and performance.<<ETX>>

Higher-order and Symbolic Computation \/ Lisp and Symbolic Computation | 1992

Implementing Concurrent Scheme for the Mayfly distributed parallel processing system

Robert R. Kessler; Harold Carr; Leigh Stoller; Mark R. Swanson

This paper discusses a parallel Lisp system developed for a distributed memory, parallel processor, the Mayfly. The language has been adapted to the problems of distributed data by providing a tight coupling of control and data, including mechanisms for mutual exclusion and data sharing. The language was primarily designed to execute on the Mayfly, but also runs on networked workstations. Initially, we show the relevant parts of the language as seen by the user. Then we concentrate on the system Lisp level implementation of these constructs with particular attention toagents, a mechanism for limiting the cost of remote operations. Briefly mentioned are the low-level kernel hardware and software support of the system Lisp primitives.

Sigplan Notices | 1993

Distributed C

Harold Carr; Robert R. Kessler; Mark R. Swanson

Distributed C++ (DC++) is a language in which to write parallel applications on loosely coupled distributed systems. Its key idea is to extend the C++ class into three categories: vanilla C++ classes, classes that act as communication and synchronization gateways between abstract processors, and classes whose instances may be passed by value between abstract processors via gateways. Value classes are deep copied and preserve structure sharing. DC++ marshals arguments when value classes are used in remote method invocations. An important result of making the three categories syntactically identical is that member function invocation of gateways looks identical to vanilla member function invocation. Therefore, gateways may be redistributed without program modification. Concurrency is achieved by creating multiple abstract processors and starting multiple threads of control operating within these abstract processors. DC++ transparently supports multitasking so that the number of actual processors may be changed without program modification. DC++ provides support for concurrency (threads), communication and synchronization (explicit ports, remote method invocation, and futures), and locality (domains). DC++ is designed to run on the Mayfly Distributed Processing System. It also runs on homogeneous workstations (HP Series 9000, Model 300/400/800/700) using BSD sockets or Mach ports for communication.

Explore More