Is this you? Create Your Porfile

Kirk L. Johnson

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kirk L. Johnson is active.

Explore More

Publication

Featured researches published by Kirk L. Johnson.

symposium on operating systems principles | 1995

CRL: high-performance all-software distributed shared memory

Kirk L. Johnson; M.F. Kaashoek; Deborah A. Wallach

The C Region Library (CRL) is a new all-software distributed shared memory (DSM) system. CRL requires no special compiler, hardware, or operating system support beyond the ability to send and receive messages between processing nodes. It provides a simple, portable, region-based shared address space programming model that is capable of delivering good performance on a wide range of multiprocessor and distributed system architectures. Each region is an arbitrarily sized, contiguous area of memory. The programmer defines regions and delimits accesses to them using annotations. CRL implementations have been developed for two platforms: the Thinking Machines CM-5, a commercial multicomputer, and the MIT Alewife machine, an experimental multiprocessor offering efficient hardware support for both message passing and shared memory. Results are presented for up to 128 processors on the CM-5 and up to 32 processors on Alewife. Using Alewife as a vehicle, this thesis presents results from the first completely controlled comparison of scalable hardware and software DSM systems. These results indicate that CRL is capable of delivering performance that is competitive with hardware DSM systems: CRL achieves speedups within 15% of those provided by Alewifes native hardware-supported shared memory, even for challenging applications (e.g., Barnes-Hut) and small problem sizes. A second set of experimental results provides insight into the sensitivity of CRLs performance to increased communication costs (both higher latency and lower bandwidth). These results demonstrate that even for relatively challenging applications, CRL should be capable of delivering reasonable performance on current-generation distributed systems. Taken together, these results indicate the substantial promise of CRL and other all-software approaches to providing shared memory functionality and suggest that in many cases special-purpose hardware support for shared memory may not be necessary. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

acm sigplan symposium on principles and practice of parallel programming | 1993

Integrating message-passing and shared-memory: early experience

David A. Kranz; Beng-Hong Lim; Kirk L. Johnson; John Kubiatowicz; Anant Agarwal

This paper discusses some of the issues involved in implementing a shared-address space programming model on large-scale, distributed-memory multiprocessors. While such a programming model can be implemented on both shared-memory and message-passing architectures, we argue that the transparent, coherent caching of global data provided by many shared-memory architectures is of crucial importance. Because message-passing mechanisms ar much more efficient than shared-memory loads and stores for certain types of interprocessor communication and synchronization operations, hwoever, we argue for building multiprocessors that efficiently support both shared-memory and message-passing mechnisms. We describe an architecture, Alewife, that integrates support for shared-memory and message-passing through a simple interface; we expect the compiler and runtime system to cooperate in using appropriate hardware mechanisms that are most efficient for specific operations. We report on both integrated and exclusively shared-memory implementations of our runtime system and two applications. The integrated runtime system drastically cuts down the cost of communication incurred by the scheduling, load balancing, and certain synchronization operations. We also present preliminary performance results comparing the two systems.

acm sigplan symposium on principles and practice of parallel programming | 1995

Optimistic active messages: a mechanism for scheduling communication with computation

Deborah A. Wallach; Wilson C. Hsieh; Kirk L. Johnson; M. Frans Kaashoek; William E. Weihl

Low-overhead message passing is critical to the performance of many applications. Active Messages reduce the software overhead for message handling: messages are run as handlers instead of as threads, which avoids the overhead of thread management and the unnecessary data copying of other communication models. Scheduling the execution of Active Messages is typically done by disabling and enabling interrupts, or by polling the network. This primitive scheduling control, combined with the fact that handlers are not schedulable entities, puts severe restrictions on the code that can be run in a message handler. This paper describes a new software mechanism, Optimistic Active Messages (OAM), that eliminates these restrictions; OAMs allow arbitrary user code to execute in handlers, and also allow handlers to block. Despite this gain in expressiveness, OAMs perform as well as Active Messages. We used OAM as the base for an RPC system, Optimistic RPC (ORPC), for the Thinking Machines CM-5 multiprocessor; it consists of an optimized thread package and a stub compiler that hides communication details from the programmer. ORPC is 1.5 to 5 times faster than traditional RPC (TRPC) for small messages and performs as well as Active Messages (AM). Applications that primarily communicate using large data transfers or are fairly coarse-grained perform equally well, independent of whether AMs, ORPCs, or TRPCs are used. For applications that send many short messages, however, the ORPC and AM implementations are up to three times faster than the TRPC implementations. Using ORPC, programmers obtain the benefits of well-proven programming abstractions such as threads, mutexes, and condition variables, do not have to be concerned with communication details, and yet obtain nearly the performance of hand-coded Active Message programs.

Proceedings of the IEEE | 1999

The MIT Alewife Machine

Anant Agarwal; Ricardo Bianchini; David Chaiken; Frederic T. Chong; Kirk L. Johnson; David M. Kranz; John Kubiatowicz; Beng-Hong Lim; Kenneth Mackenzie; Donald Yeung

A variety of models for parallel architectures, such as shared memory, message passing, and data flow, have converged in the recent past to a hybrid architecture form called distributed shared memory (DSM). Alewife, an early prototype of such DSM architectures, uses hybrid software and hardware mechanisms to support coherent shared memory, efficient user level messaging, fine grain synchronization, and latency tolerance. Alewife supports up to 512 processing nodes connected over a scalable and cost effective mesh network at a constant cost per node. Four mechanisms combine to achieve Alewifes goals of scalability and programmability: software extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms-including block multithreading and prefetching-mask unavoidable delays due to communication. Extensive results from microbenchmarks, together with over a dozen complete applications running on a 32-node prototype, demonstrate that integrating message passing with shared memory enables a cost efficient solution to the cache coherence problem and provides a rich set of programming primitives. Our results further show that messaging and shared memory operations are both important because each helps the programmer to achieve the best performance for various machine configurations.

international symposium on computer architecture | 1992

The impact of communication locality on large-scale multiprocessor performance

Kirk L. Johnson

As multiprocessor sizes scale and computer architects turn to interconnection networks with non-uniform communication latencies, the lure of exploiting communication locality to increase performance become inevitable. Models that accurately quantify locality effects provide invaluable insight into the importance of exploiting locality as machine sizes and features change. This paper presents a framework for modeling the impact of communication locality on system performance. The framework provides a means for combining simple models of application, processor, and network behavior to obtain a combined model that accurately reflects feedback effects between processors and networks. We introduce a model that characterizes application behavior with three parameters that capture computation grain, sensitivity to communication latency, and amount of locality present at execution time. The combined model, we show that exploiting communication locality provides gains which are at most linear in the factor by which average communication distance is reduced when the number of outstanding communication transactions per processor is bounded. The combined model is also used to obtain rough upper bounds on the performance improvement from exploiting locality to minimize communication distance.

Proceedings of the US/Japan Workshop on Parallel Symbolic Computing: Languages, Systems, and Applications | 1992

Sparcle: A Multithreaded VLSI Processor for Parallel Computing

Anant Agarwal; Jonathan Babb; David Chaiken; Godfrey D'Souza; Kirk L. Johnson; David A. Kranz; John Kubiatowicz; Beng-Hong Lim; Gino K. Maa; Kenneth Mackenzie; Daniel Nussbaum; Mike Parkin; Donald Yeung

The Sparcle chip will clock at no more than 50 MHz. It has no more than 200K transistors. It does not use the latest technologies and dissipates a paltry 2 watts. It has no on-chip cache, no fancy pads, and only 207 pins. It does not even support multiple-instructions issue. Then, why do we think this chip is interesting? Sparcle is a processor chip designed for large-scale multiprocessing. Processors suitable for multiprocessing environments must meet several requirements:

acm sigops european workshop | 1994

Optimistic active messages: structuring systems for high-performance communication

M. Frans Kaashoek; William E. Weihl; Deborah A. Wallach; Wilson C. Hsieh; Kirk L. Johnson

Recent networks and network interfaces promise remarkable communication performance with very little overhead, but current software structures impose substantial overhead that prevents applications from achieving the benefits of these new architectures. We propose a new software structure that eliminates much of the overhead while preserving the ease of programming of current systems. Our architecture relies on the compiler to bridge the gap between high-level application programs and low-level communication primitives. The compiler incorporates application code into message handlers using a new runtime mechanism called optimistic active messages.

international symposium on computer architecture | 1995

The MIT Alewife machine: architecture and performance

Anant Agarwal; Ricardo Bianchini; David Chaiken; Kirk L. Johnson; David A. Kranz; John Kubiatowicz; Beng-Hong Lim; Kenneth Mackenzie; Donald Yeung

Archive | 1991

THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR

Anant Agarwal; David Chaiken; Kirk L. Johnson; David A. Kranz; John Kubiatowicz; K. Kurihara; Beng-Hong Lim; Gino K. Maa; Daniel Nussbaum; Mike Parkin; Donald Yeung

High-performance all-software distributed shared memory | 1996