Is this you? Create Your Porfile

David A. Kranz

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David A. Kranz is active.

Explore More

Publication

Featured researches published by David A. Kranz.

international symposium on computer architecture | 1990

APRIL: a processor architecture for multiprocessing

Anant Agarwal; Beng-Hong Lim; David A. Kranz; John Kubiatowicz

Processors in large-scale multiprocessors must be able to tolerate large communication latencies and synchronization delays. This paper describes the architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization. APRIL achieves high single-thread performance and supports virtual dynamic threads. A commercial RISC-based implementation of APRIL and a run-time software system that can switch contexts in about 10 cycles is described. Measurements taken for several parallel applications on an APRIL simulator show that the overhead for supporting parallel tasks based on futures is reduced by a factor of two over a corresponding implementation on the Encore Multimax. The scalability of a multiprocessor based on APRIL is explored using a performance model. We show that the SPARC-based implementation of APRIL can achieve close to 80% processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.

IEEE Transactions on Parallel and Distributed Systems | 1991

Lazy task creation: a technique for increasing the granularity of parallel programs

Eric Mohr; David A. Kranz; Robert H. Halstead

Many parallel algorithms are naturally expressed at a fine level of granularity, often finer than a MIMD parallel system can exploit efficiently. Most builders of parallel systems have looked to either the programmer or a parallelizing compiler to increase the granularity of such algorithms. In this paper, the authors explore a third approach to the granularity problem by analyzing two strategies for combining parallel tasks dynamically at run-time. They reject the simpler load-based inlining method, where tasks are combined based on dynamic load level, in favor of the safer and more robust lazy task creation method, where tasks are created only retroactively as processing resources become available. These strategies grew out of work on Mul-T 17, an efficient parallel implementation of Scheme, but could be used with other languages as well. They describe our Mul-T implementations of lazy task creation for two contrasting machines, and present performance statistics which show the methods effectiveness. Lazy task creation allows efficient execution of naturally expressed algorithms of a substantially finer grain than possible with previous parallel Lisp systems.

international symposium on microarchitecture | 1993

Sparcle: an evolutionary processor design for large-scale multiprocessors

Anant Agarwal; John Kubiatowicz; David A. Kranz; Beng-Hong Lim; Donald Yeung; Godfrey D'Souza; Mike Parkin

The design of the Sparcle chip, which incorporates mechanisms required for massively parallel systems in a Sparc RISC core, is described. Coupled with a communications and memory management chip (CMMU) Sparcle allows a fast, 14-cycle context switch, an 8-cycle user-level message send, and fine-grain full/empty-bit synchronization. Sparcles fine-grain computation, memory latency tolerance, and efficient message interface are discussed. The implementation of Sparcle as a CPU for the Alewife machine is described.<<ETX>>

Proceedings of the US/Japan Workshop on Parallel Lisp: Languages and Systems | 1989

Mul-T: A High-Performance Parallel Lisp

David A. Kranz; Robert H. Halstead; Eric Mohr

The development of Mul-T has been valuable in several ways. First, Mul-T is a complete, working parallel Lisp system, publicly available to interested users. Second, its single-processor performance is competitive with that of “production quality” sequential Lisp implementations, and therefore a parallel program running under Mul-T can show absolute speedups over the best sequential implementation of the same algorithm. This is attractive to application users whose primary interest is raw speed rather than the abstract gratification of having demonstrated speedup via a time-consuming simulation. Finally, implementing Mul-T has allowed us to experiment with and evaluate implementation strategies such as inlining. The Mul-T experience has also allowed us to probe the limits of implementing futures on stock multiprocessors, and has suggested (for example) that hardware assistance for tag management may be a more significant benefit in a machine for parallel Lisp (where it can eliminate the 65% overhead of implicit touches) than it has ever proven to be in machines for sequential Lisps.

IEEE Transactions on Parallel and Distributed Systems | 1995

Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors

Anant Agarwal; David A. Kranz; Venkat Natarajan

Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor. >

acm sigplan symposium on principles and practice of parallel programming | 1993

Integrating message-passing and shared-memory: early experience

David A. Kranz; Beng-Hong Lim; Kirk L. Johnson; John Kubiatowicz; Anant Agarwal

This paper discusses some of the issues involved in implementing a shared-address space programming model on large-scale, distributed-memory multiprocessors. While such a programming model can be implemented on both shared-memory and message-passing architectures, we argue that the transparent, coherent caching of global data provided by many shared-memory architectures is of crucial importance. Because message-passing mechanisms ar much more efficient than shared-memory loads and stores for certain types of interprocessor communication and synchronization operations, hwoever, we argue for building multiprocessors that efficiently support both shared-memory and message-passing mechnisms. We describe an architecture, Alewife, that integrates support for shared-memory and message-passing through a simple interface; we expect the compiler and runtime system to cooperate in using appropriate hardware mechanisms that are most efficient for specific operations. We report on both integrated and exclusively shared-memory implementations of our runtime system and two applications. The integrated runtime system drastically cuts down the cost of communication incurred by the scheduling, load balancing, and certain synchronization operations. We also present preliminary performance results comparing the two systems.

programming language design and implementation | 1989

Mul-T: a high-performance parallel Lisp

David A. Kranz; Robert H. Halstead; Eric Mohr

Mul-T is a parallel Lisp system, based on Multilisps future construct, that has been developed to run on an Encore Multimax multiprocessor. Mul-T is an extended version of the Yale T system and uses the T systems ORBIT compiler to achieve “production quality” performance on stock hardware — about 100 times faster than Multilisp. Mul-T shows that futures can be implemented cheaply enough to be useful in a production-quality system. Mul-T is fully operational, including a user interface that supports managing groups of parallel tasks.

international conference on parallel processing | 1993

Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors

Anant Agarwal; David A. Kranz; Venkat Natarajan

This paper presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions and multiple arrays has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for data and loop partitioning in distributed memory multiprocessors without caches. Like previous papers, our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use lattice theory to compute precisely the size of data footprints. We have implemented a subset of this framework in a compiler for the Alewife machine.

languages and compilers for parallel computing | 1996

Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors

Rajeev Barua; David A. Kranz; Anant Agarwal

Harnessing the full performance potential of cache-coherent distributed shared memory multiprocessors without inordinate user effort requires a compilation technology that can automatically manage multiple levels of memory hierarchy. This paper describes a working compiler for such machines that automatically partitions loops and data arrays to optimize locality of access.

Proceedings of the US/Japan Workshop on Parallel Symbolic Computing: Languages, Systems, and Applications | 1992

MulTVision: A Tool for Visualizing Parallel Program Executions

Robert H. Halstead; David A. Kranz; Patrick G. Sobalvarro

MulTVision is a visualization tool that supports both performance measurement and debugging by helping a programmer see what happens during a specific, traced execution of a program. MulTVision has two components: a debug monitor and a replay engine. A traced execution yields a log as a by-product; both the debug monitor and the replay engine use this log as input. The debug monitor produces a graphical display showing the relationships between tasks in the traced execution. Using this display, a programmer can see bottlenecks or other causes of poor performance. The replay engine can be used to reproduce internal program states that existed during the traced execution. The replay engine uses a novel log protocol—the side- effect touch protocol-oriented toward programs that are mostly functional (have few side effects). Measurements show that the tracing overhead added to mostly functional programs is generally less than the overhead already incurred for task management and touch operations. While currently limited to program executions that create at most tens of thousands of tasks, MulTVision is already useful for an interesting class of programs.

Explore More