Is this you? Create Your Porfile

Klaus E. Schauser

University of California, Santa Barbara

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Klaus E. Schauser is active.

Explore More

Publication

Featured researches published by Klaus E. Schauser.

international symposium on computer architecture | 1992

Active messages: a mechanism for integrated communication and computation

Thorsten von Eicken; David E. Culler; Seth Copen Goldstein; Klaus E. Schauser

The design challenge for large-scale multiprocessors is (1) to minimize communication overhead, (2) allow communication to overlap computation, and (3) coordinate the two without sacrificing processor cost/performance. We show that existing message passing multiprocessors have unnecessarily high communication costs. Research prototypes of message driven machines demonstrate low communication overhead, but poor processor cost/performance. We introduce a simple communication mechanism, Active Messages, show that it is intrinsic to both architectures, allows cost effective use of the hardware, and offers tremendous flexibility. Implementations on nCUBE/2 and CM-5 are described and evaluated using a split-phase shared-memory extension to C, Split-C. We further show that active messages are sufficient to implement the dynamically scheduled languages for which message driven machines were designed. With this mechanism, latency tolerance becomes a programming/compiling concern. Hardware support for active messages is desirable and we outline a range of enhancements to mainstream processors.

acm sigplan symposium on principles and practice of parallel programming | 1993

LogP: towards a realistic model of parallel computation

David E. Culler; Richard M. Karp; David A. Patterson; Abhijit Sahay; Klaus E. Schauser; Eunice E. Santos; Ramesh Subramonian; Thorsten von Eicken

A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

acm symposium on parallel algorithms and architectures | 1995

LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Albert Alexandrov; Mihai F. Ionescu; Klaus E. Schauser; Chris J. Scheiman

We present a new model of parallel computation---the LogGP model---and use it to analyze a number of algorithms, most notably, the single node scatter (one-to-all personalized broadcast). The LogGP model is an extension of the LogP model for parallel computation which abstracts the communication of fixed-sized short messages through the use of four parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). As evidenced by experimental data, the LogP model can accurately predict communication performance when only short messages are sent (as on the CM-5). However, many existing parallel machines have special support for long messages and achieve a much higher bandwidth for long messages compared to short messages (e.g., IBM SP-2, Paragon, Meiko CS-2, Ncube/2). We extend the basic LogP model with a linear model for long messages. This combination, which we call the LogGP model of parallel computation, has one additional parameter, G, which captures the bandwidth obtained for long messages. Experimental data collected on the Meiko CS-2 shows that this simple extension of the LogP model can quite accurately predict communication performance for both short and long messages. This paper discusses algorithm design and analysis under the new model, examining the all-to-all remap, FFT, and radix sort. We also examine, in more detail, the single node scatter problem. We derive solutions for this problem and prove their optimality under the LogGP model. These solutions are qualitatively different from those obtained under the simpler LogP model, reflecting the importance of capturing long messages in a model.

Concurrency and Computation: Practice and Experience | 1997

Javelin: Internet-based parallel computing using Java

Bernd Oliver Christiansen; Peter R. Cappello; Mihai F. Ionescu; Michael O. Neary; Klaus E. Schauser; Daniel Wu

The JAVM (Java Astra Virtual Machine) project is about harnessing the immense computational resource available in the Internet for parallel processing. In this paper, the suitability of Java for Internet-based parallel computing is explored. Next, existing implementations of systems that make use of Java for network parallel computing are presented and categorized. A critique of these implementations follows. Basing on the critique, the requirements and goals of an effective parallel computing system in the Internet environment are singled out. These serve as the blueprint for the development of the JAVM system. Its infrastructure and features, namely ease of use, heterogeneity, portability, security, fault tolerance, load balancing, scalability and accountability, are discussed. Lastly, experimental results based on the running of several parallel applications in the JAVM environment are presented. Basing on the results, the kind of parallel applications that would be well suited for running in JAVM are identified.

acm symposium on parallel algorithms and architectures | 1993

Optimal broadcast and summation in the LogP model

Richard M. Karp; Abhijit Sahay; Eunice E. Santos; Klaus E. Schauser

We consider several natural broadcasting problems for the LogP model of distributed memory machines recently proposed by Culler et al. For each of these problems, we present algorithms that yield an optimal communication schedule. Our algorithms are absolutely best possible in that not even the constant factors can be improved upon. We also devise an (absolutely) optimal algorithm for summing a list of elements (using a non- commutative operation) using one of the optimal broadcast algorithms.

Journal of Parallel and Distributed Computing | 1993

TAM—a compiler controlled threaded abstract machine

David E. Culler; Seth Copen Goldstein; Klaus E. Schauser; Thorsten von Eicken

Abstract The Threaded Abstract Machine (TAM) refines dataflow execution models to address the critical constraints that modern parallel architectures place on the compilation of general-purpose parallel programming languages. TAM defines a self-scheduled machine language of parallel threads, which provides a path from data-flow-graph program representations to conventional control flow. The most important feature of TAM is the way it exposes the interaction between the handling of asynchronous message events, the scheduling of computation, and the utilization of the storage hierarchy. This paper provides a complete description of TAM and codifies the model in terms of a pseudo machine language TL0. Issues in compilation from a high level parallel language to TL0 are discussed in general and specifically in regard to the Id90 language. The implementation of TL0 on the CM-5 multiprocessor is explained in detail. Using this implementation, a cost model is developed for the various TAM primitives. The TAM approach is evaluated on sizable Id90 programs on a 64 processor system. The scheduling hierarchy of quanta and threads is shown to provide substantial locality while tolerating long latencies. This allows the average thread scheduling cost to be extremely low.

Journal of Parallel and Distributed Computing | 1996

Lazy Threads

Seth Copen Goldstein; Klaus E. Schauser; David E. Culler

In this paper, we describe lazy threads, a new approach for implementing multithreaded execution models on conventional machines. We show how they can implement a parallel call at nearly the efficiency of a sequential call. The central idea is to specialize the representation of a parallel call so that it can execute as a parallel-ready sequential call. This allows excess parallelism to degrade into sequential calls with the attendant efficient stack management and direct transfer of control and data, yet a call that truly needs to execute in parallel, gets its own thread of control. The efficiency of lazy threads is achieved through a careful attention to storage management and a code generation strategy that allows us to represent potential parallel work with no overhead.

Concurrency and Computation: Practice and Experience | 1997

SuperWeb: research issues in Java-based global computing

Albert Alexandrov; Maximilian Ibel; Klaus E. Schauser; Chris J. Scheiman

The Internet, in particular the World Wide Web, continues to expand at an amazing pace. We propose a new infrastructure, SuperWeb, to harness global resources, such as CPU cycles or disk storage, and make them available to every user on the Internet. SuperWeb has the potential for solving parallel supercomputing applications involving thousands of co-operating components on the Internet. However, we anticipate that initial implementations will be used inside large organizations with large heterogeneous intranets. Our approach is based on recent advances in Internet connectivity and the implementation of safe distributed computing realized by languages such as Java. Our SuperWeb prototype consists of brokers, clients and hosts. Hosts register a fraction of their computing resources (CPU time, memory, bandwidth, disk space) with resource brokers. Clients submit tasks that need to be executed. The broker maps client computations onto the registered hosts. We examine an economic model for trading computing resources, and discuss several technical challenges associated with such a global computing environment.

IEEE Transactions on Parallel and Distributed Systems | 1996

Fast parallel sorting under LogP: experience with the CM-5

Andrea C. Dusseau; David E. Culler; Klaus E. Schauser; Richard P. Martin

In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms.

ACM Transactions on Computer Systems | 1998

UFO: a personal global file system based on user-level extensions to the operating system

Albert Alexandrov; Maximilian Ibel; Klaus E. Schauser; Chris J. Scheiman

In this article we show how to extend a wide range of functionality of standard operation systems completely at the user level. Our approach works by intercepting selected system calls at the user level, using tracing facilities such as the /proc file system provided by many Unix operating systems. The behavior of some intercepted system calls is then modified to implement new functionality. This approach does not require any relinking or recompilation of existing applications. In fact, the extensions can even be dynamically “installed” into already running processes. The extensions work completely at the user level and install without system administrator assistance. Individual users can choose what extensions to run, in effect creating a personalized operating system view for themselves. We used this approach to implement a global file system, called Ufo, which allows users to treat remote files exactly as if they were local. Currently, Ufo supports file access through the FTP and HTTP protocols and allows new protocols to be plugged in. While several other projects have implemented global file system abstractions, they all require either changes to the operating system or modifications to standard libraries. The article gives a detailed performance analysis of our approach to extending the OS and establishes that Ufo introduces acceptable overhead for common applications even though intercepting individual system calls incurs a high cost.

Explore More