Is this you? Create Your Porfile

Keith H. Randall

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keith H. Randall is active.

Explore More

Publication

Featured researches published by Keith H. Randall.

acm sigplan symposium on principles and practice of parallel programming | 1995

Cilk: an efficient multithreaded runtime system

Robert D. Blumofe; Christopher F. Joerg; Bradley C. Kuszmaul; Charles E. Leiserson; Keith H. Randall; Yuli Zhou

Cilk (pronounced “silk”) is a C-based runtime system for multi-threaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the “work” and “critical path” of a Cilk computation can be used to accurately model performance. Consequently, a Cilk programmer can focus on reducing the work and critical path of his computation, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time and communication bounds all within a constant factor of optimal. The Cilk runtime system currently runs on the Connection Machine CM5 MPP, the Intel Paragon MPP, the Silicon Graphics Power Challenge SMP, and the MIT Phish network of workstations. Applications written in Cilk include protein folding, graphic rendering, backtrack search, and the *Socrates chess program, which won third prize in the 1994 ACM International Computer Chess Championship.

acm symposium on parallel algorithms and architectures | 1996

An analysis of dag-consistent distributed shared-memory algorithms

Robert D. Blumofe; Matteo Frigo; Christopher F. Joerg; Charles E. Leiserson; Keith H. Randall

In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C) of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞), where T1(C) is the total work of the computation including page faults, T∞ is its critical-path length excluding page faults, and m is the minimum page transfer time. As a corollary to this theorem, we show that the expected number FP(C) of page faults incurred by a computation executed on P processors can be related to the number F1(C) of serial page faults by the formula FP(C) F1(C)+O(CPT∞). Finally, we give simple bounds on the number of page faults and the space requirements for “regular” divide-and-conquer algorithms. We use these bounds to analyze parallel multithreaded algorithms for matrix multiplication and LU-decomposition.

programming language design and implementation | 2002

Denali: a goal-directed superoptimizer

Rajeev Joshi; Greg Nelson; Keith H. Randall

This paper provides a preliminary report on a new research project that aims to construct a code generator that uses an automatic theorem prover to produce very high-quality (in fact, nearly mathematically optimal) machine code for modern architectures. The code generator is not intended for use in an ordinary compiler, but is intended to be used for inner loops and critical subroutines in those cases where peak performance is required, no available compiler generates adequately efficient code, and where current engineering practice is to use hand-coded machine language. The paper describes the design of the superoptimizer, and presents some encouraging preliminary results.

acm symposium on parallel algorithms and architectures | 1998

Detecting data races in Cilk programs that use locks

Guang-Ien Cheng; Mingdong Feng; Charles E. Leiserson; Keith H. Randall; Andrew F. Stark

When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a “data race” occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which checks for data races in programs coded in the Cilk multithreaded language. Like its predecessor, the Nondeterminator, which checks for simple “determinacy” races, the Nondeterminator-2 is a debugging tool, not a verifier, since it checks for data races only in the computation generated by a serial execution of the program on a given input. We give an algorithm, ALL-SETS, that determines whether the computation generated by a serial execution of a Cilk program on a given input contains a race. For a program that runs serially in time T , accesses V shared memory locations, uses a total of n locks, and holds at most k n locks simultaneously, ALL-SETS runs in O(nkT α(V;V )) time and O(nkV ) space, where α is Tarjan’s functional inverse of Ackermann’s function. Since ALL-SETS may be too inefficient in the worst case, we propose a much more efficient algorithm which can be used to detect races in programs that obey the “umbrella” locking discipline, a programming methodology that is more flexible than similar disciplines proposed in the literature. We present an algorithm, BRELLY, which detects violations of the umbrella discipline in O(kT α(V;V )) time using O(kV ) space. We also prove that any “abelian” Cilk program, one whose critical sections commute, produces a determinate final state if it is deadlock free and if it generates any computation which is datarace free. Thus, the Nondeterminator-2’s two algorithms can verify the determinacy of a deadlock-free abelian program running on a given input. Keywords

international conference on parallel processing | 1996

DAG-consistent distributed shared memory

Robert D. Blumofe; Matteo Frigo; Christopher F. Joerg; Charles E. Leiserson; Keith H. Randall

Introduces DAG (directed acyclic graph) consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented DAG consistency in software for the Cilk multithreaded runtime system running on a CM5 Connection Machine. Our implementation includes a DAG-consistent distributed cactus stack for storage allocation. We provide empirical evidence of the flexibility and efficiency of DAG consistency for applications that include blocked matrix multiplication, Strassens (1969) matrix multiplication algorithm and a Barnes-Hut code. Although Cilk schedules the executions of these programs dynamically, their performances are competitive with statically scheduled implementations in the literature. We also prove that the number F/sub P/ of page faults incurred by a user program running an P processors can be related to the number F/sub 1/ of page faults running serially by the formula F/sub P//spl les/F/sub 1/+2Cs, where C is the cache size and s is the number of thread migrations executed by Cilks scheduler.

programming language design and implementation | 2000

Field analysis: getting useful and low-cost interprocedural information

Sanjay Ghemawat; Keith H. Randall; Daniel J. Scales

We present a new limited form of interprocedural analysis called field analysis that can be used by a compiler to reduce the costs of modern language features such as object-oriented programming,automatic memory management, and run-time checks required for type safety.Unlike many previous interprocedural analyses, our analysis is cheap, and does not require access to the entire program. Field analysis exploits the declared access restrictions placed on fields in a modular language (e.g. field access modifiers in Java) in order to determine useful properties of fields of an object. We describe our implementation of field analysis in the Swiftoptimizing compiler for Java, as well a set of optimizations thatexploit the results of field analysis. These optimizations include removal of run-time tests, compile-time resolution of method calls, object inlining, removal of unnecessary synchronization, and stack allocation. Our results demonstrate that field analysis is efficient and effective. Speedups average 7% on a wide range of applications, with some times reduced by up to 27%. Compile time overhead of field analysis is about 10%.

design automation conference | 1993

TIM: A Timing Package for Two-Phase, Level-Clocked Circuitry

Marios C. Papaefthymiou; Keith H. Randall

TIM is a versatile and efficient tool for verifying and optimizing the timing of two-phase, level-clocked circuitry. TIM performs a variety of functions, such as timing verification, clock tuning, retiming for maximum speed of operation, retiming for minimum number of latches, and sensitivity analysis. In this paper, we present new polynomial-time optimization algorithms for retiming and sensitivity analysis, and we describe the implementation of the new and previously reported algorithms in TIM. We also present empirical results from the application of TIM to sequential circuitry obtained from academic and industrial sources. Our experiments show that the number of latches in edge-triggered designs which have been retimed for maximum performance can be substantially reduced in corresponding two-phase, level-clocked designs that operate at the same speed.

programming language design and implementation | 2001

Related field analysis

Aneesh Aggarwal; Keith H. Randall

We present an extension of field analysis (sec [4]) called related field analysis which is a general technique for proving relationships between two or more fields of an object. We demonstrate the feasibility and applicability of related field analysis by applying it to the problem of removing array bounds checks. For array bounds check removal, we define a pair of related fields to be an integer field and an array field for which the integer field has a known relationship to the length of the array. This related field information can then be used to remove array bounds checks from accesses to the array field. Our results show that related field analysis can remove an average of 50% of the dynamic array bounds checks on a wide range of applications. We describe the implementation of related field analysis in the Swift optimizing compiler for Java, as well as the optimizations that exploit the results of related field analysis.

acm symposium on parallel algorithms and architectures | 1995

Parallel algorithms for the circuit value update problem

Charles E. Leiserson; Keith H. Randall

The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinational element has bounded fan-in and fan-out and can be evaluated in constant time. This problem is easily solved on an ordinary serial computer in O(W+D) time, where W is the number of elements in the altered subcircuit and D is the subcircuits embedded depth (its depth measured in the original circuit).

programming language design and implementation | 1998