David Callahan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Callahan is active.

Explore More

Publication

Featured researches published by David Callahan.

international conference on supercomputing | 1990

The Tera computer system

Robert L. Alverson; David Callahan; Daniel Cummings; Brian D. Koblenz; Allan Porterfield; Burton J. Smith

The Tera architecture was designed with several ma jor goals in mind. First, it needed to be suitable for very high speed implementations, i. e., admit a short clock period and be scalable to many processors. This goal will be achieved; a maximum configuration of the first implementation of the architecture will have 256 processors, 512 memory units, 256 I/O cache units, 256 I/O processors, and 4096 interconnection network nodes and a clock period less than 3 nanoseconds. The abstract architecture is scalable essentially without limit (although a particular implementation is not, of course). The only requirement is that the number of instruction streams increase more rapidly than the number of physical processors. Although this means that speedup is sublinear in the number of instruction streams, it can still increase linearly with the number of physical pro cessors. The price/performance ratio of the system is unmatched, and puts Tera’s high performance within economic reach. Second, it was important that the architecture be applicable to a wide spectrum of problems. Programs that do not vectoriae well, perhaps because of a preponderance of scalar operations or too-frequent conditional branches, will execute efficiently as long as there is sufficient parallelism to keep the processors busy. Virtually any parallelism available in the total computational workload can be turned into speed, from operation level parallelism within program basic blocks to multiuser timeand space-sharing. The architecture

ieee international conference on high performance computing data and analytics | 2007

Parallel Programmability and the Chapel Language

Bradford L. Chamberlain; David Callahan; Hans P. Zima

In this paper we consider productivity challenges for parallel programmers and explore ways that parallel language design might help improve end-user productivity. We offer a candidate list of desirable qualities for a parallel programming language, and describe how these qualities are addressed in the design of the Chapel language. In doing so, we provide an overview of Chapels features and how they help address parallel productivity. We also survey current techniques for parallel programming and describe ways in which we consider them to fall short of our idealized productive programming model.

architectural support for programming languages and operating systems | 1991

Software prefetching

David Callahan; Ken Kennedy; Allan Porterfield

We present an approach, called software prefetching, to reducing cache miss latencies. By providing a nonblocking prefetch instruction that causes data at a specified memory address to be brought into cache, the compiler can overlap the memory latency with other computation. Our simulations show that, even when generated by a very simple compiler algorithm, prefetch instructions can eliminate nearly all cache misses, while causing only modest increases in data traffic between memory and cache.

programming language design and implementation | 1990

Improving register allocation for subscripted variables

David Callahan; Steve Carr; Ken Kennedy

Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is particularly troublesome for floating-point registers, which are most often used as temporary repositories for subscripted variables.In this paper, we present a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The objective is to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers. In addition, we present transformations to improve the overall effectiveness of scalar replacement and show how these transformations can be applied in a variety of loop nest types. Finally, we present experimental results showing that these techniques are extremely effective---capable of achieving integer factor speedups over code generated by good optimizing compilers of conventional design.

The Journal of Supercomputing | 1988

Compiling programs for distributed-memory multiprocessors

David Callahan; Ken Kennedy

We describe a new approach to programming distributed-memory computers. Rather than having each node in the system explicitly programmed, we derive an efficient message-passing program from a sequential shared-memory program annotated with directions on how elements of shared arrays are distributed to processors. This article describes one possible input language for describing distributions and then details the compilation process and the optimization necessary to generate an efficient program.

programming language design and implementation | 1991

Register allocation via hierarchical graph coloring

David Callahan; Brian D. Koblenz

We present a graph coloring register allocator designed to minimize the number of dynamic memory references. We cover the program with sets of blocks called tiles and group these tiles into a tree reflecting the program’s hierarchical control structure. Registers are allocated for each tile using standard graph coloring techniques and the local allocation and conflict information is passed around the tree in a two phase algorithm. This results in an allocation of registers that is sensitive to local usage patterns while retaining a global perspective. Spill code is placed in less frequently executed portions of the program and the choice of variables to spill is based on usage patterns between the spills and the reloads rather than usage patterns over the entire program.

compiler construction | 1986

Interprocedural constant propagation

David Callahan; Keith D. Cooper; Ken Kennedy; Linda Torczon

In a compiling system that attempts to improve code for a whole program by optimizing across procedures, the compiler can generate better code for a specific procedure if it knows which variables will have constant values, and what those values will be, when the procedure is invoked. This paper presents a general algorithm for determining for each procedure in a given program the set of inputs that will have known constant values at run time. The precision of the answers provided by this method are dependent on the precision of the local analysis of individual procedures in the program. Since the algorithm is intended for use in a sophisticated software development environment in which local analysis would be provided by the source editor, the quality of the answers will depend on the amount of work the editor performs. Several reasonable strategies for local analysis with different levels of complexity and precision are suggested and the results of a prototype implementation in a vectorizing Fortran compiler are presented.

symposium on principles of programming languages | 1987

Automatic decomposition of scientific programs for parallel execution

Randy Allen; David Callahan; Ken Kennedy

An algorithm for transforming sequential programs into equivalent parallel programs is presented. The method concentrates on finding loops whose separate iterations can be run in parallel without synchronization. Although a simple version of the method can be shown to be optimal, the problem of generating optimal code when loop interchange is employed is shown to be intractable. These methods are implemented in an experimental translation system developed at Rice University.

international conference on supercomputing | 2014

Analysis of interprocedural side effects in a parallel programming environment

David Callahan; Ken Kennedy

This paper addresses the analysis of subroutine side effects in the ParaScope programming environment, an ambitious collection of tools for developing, understanding, and compiling parallel programs. In spite of significant progress in the optimization of programs for execution on parallel and vector computers, compilers must still be very conservative when optimizing the code surrounding a call site, due to the lack of information about the code in the subroutine being invoked. This has resulted in the development of algorithms for interprocedural analysis of the side effects of a subroutine, which summarize the body of a subroutine, producing approximate information to improve optimization. This paper reviews the effectiveness of these methods in preparing programs for execution on parallel computers. It is shown that existing techniques are insufficient and a new technique, called regular section analysis, will be described.

programming language design and implementation | 1988

The program summary graph and flow-sensitive interprocedual data flow analysis

David Callahan

This paper discusses a method for interprocedural data flow analysis which is powerful enough to express flowsensitive problems but fast enough to apply to very large programs. While such information could be applied toward standard program optimizations, the research described here is directed toward software tools for parallel programming, in which it is crucial. Many of the recent “supercomputers” can be roughly characterized as shared memory multi-processors. These include top-of-the-line systems from Cray Research and IBM, as well as multi-processor computers developed and successfully marketed by many younger companies. Development of efficient, correct programs on these machines presents new challenges to the designers of compilers, debuggers, and programming environments. Powerful analysis mechanisms have been developed for understanding the structure of programs. One such mechanism, data dependence analysis, has been evolving for many years. The product of data dependence analysis is a dota dependence gmph, a directed multi-graph that describes the interactions of program components through shared memory. Such a graph has been shown useful for a variety of applications from vectorization and parallelization to compiler management of locality. Another application of the data dependence graph is as an aid to static debugging of parallel programs. PTOOL [4] is a software system developed at Rice University to help programmers understand parallel programs. It is within this context that we at Rice have learned of the importance of interprocedural data flow analysis. I will briefly describe the PTOOL system and explain the kind of interprocedural information valuable in such an environment. PTOOL is designed to help locate interactions between

Explore More