Lawrence Rauchwerger | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lawrence Rauchwerger is active.

Explore More

Publication

Featured researches published by Lawrence Rauchwerger.

IEEE Transactions on Parallel and Distributed Systems | 1999

The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization

Lawrence Rauchwerger; David A. Padua

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall and apply a fully parallel data dependence test to determine if it had any cross-iteration dependences; if the test fails, then the loop is reexecuted serially. Since, from our experience, a significant amount of the available parallelism in Fortran programs can be exploited by loops transformed through privatization and reduction parallelization, our methods can speculatively apply these transformations and then check their validity at run-time. Another important contribution of this paper is a novel method for reduction recognition which goes beyond syntactic pattern matching: it detects at run-time if the values stored in an array participate in a reduction operation, even if they are transferred through private variables and/or are affected by statically unpredictable control flow. We present experimental results on loops from the PERFECT Benchmarks, which substantiate our claim that these techniques can yield significant speedups which are often superior to those obtainable by inspector/executor methods.

acm sigplan symposium on principles and practice of parallel programming | 2005

A framework for adaptive algorithm selection in STAPL

Nathan Thomas; Gabriel Tanase; Olga Tkachyshyn; Jack Perdue; Nancy M. Amato; Lawrence Rauchwerger

Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the run-time environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at run-time. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines run-time tests that correctly select the best performing algorithm from among several competing algorithmic options in 86-100% of the cases studied, depending on the operation and the system.

languages and compilers for parallel computing | 2001

STAPL: an adaptive, generic parallel C++ library

Ping An; Alin Jula; Silvius Rus; Steven Saunders; Timmie G. Smith; Gabriel Tanase; Nathan Thomas; Nancy M. Amato; Lawrence Rauchwerger

The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ANSI C++ Standard Template Library (STL). It is sequentially consistent for functions with the same name, and executes on uni- or multi-processor systems that utilize shared or distributed memory. STAPL is implemented using simple parallel extensions of C++ that currently provide a SPMD model of parallelism, and supports nested parallelism. The library is intended to be general purpose, but emphasizes irregular programs to allow the exploitation of parallelism for applications which use dynamically linked data structures such as particle transport calculations, molecular dynamics, geometric modeling, and graph algorithms. STAPL provides several different algorithms for some library routines, and selects among them adaptively at runtime. STAPL can replace STL automatically by invoking a preprocessing translation phase. In the applications studied, the performance of translated code was within 5% of the results obtained using STAPL directly. STAPL also provides functionality to allow the user to further optimize the code and achieve additional performance gains. We present results obtained using STAPL for a molecular dynamics code and a particle transport code.

international parallel and distributed processing symposium | 2002

The R-LRPD test: speculative parallelization of partially parallel loops

Francis H. Dang; Hao Yu; Lawrence Rauchwerger

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. In our previously proposed framework we have speculatively executed a loop as a doall, and applied a fully parallel data dependence test to determine if it had any cross-processor dependences; if the test failed, then the loop was re-executed serially. While this method exploits doall parallelism well, it can cause slowdowns for loops with even one cross-processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. We now propose a generalization of our speculative doall parallelization technique, called the Recursive LRPD test, that can extract and exploit the maximum available parallelism of any loop and that limits potential slowdowns to the overhead of the runtime dependence test itself. We present the base algorithm and an analysis of the different heuristics for its practical application and a few experimental results on loops from Track, Spice, and FMA3D.

high-performance computer architecture | 1998

Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors

Ye Zhang; Lawrence Rauchwerger; Josep Torrellas

Run-time parallelization is often the only way to execute the code in parallel when data dependence information is incomplete at compile time. This situation is common in many important applications. Unfortunately, known techniques for run-time parallelization are often computationally expensive or not general enough. To address this problem, we propose new hardware support for efficient run-time parallelization in distributed shared-memory (DSM) multiprocessors. The idea is to execute the code in parallel speculatively and use extensions to the cache coherence protocol hardware to detect any dependence violations. As soon as a dependence is detected, execution stops, the state is restored, and the code is re-executed serially. This scheme, which we apply to loops, allows iterations to execute and complete in potentially any order. This scheme requires hardware extensions to the cache coherence protocol and memory hierarchy of a DSM. It has low overhead. We present the algorithms and a hardware design of the scheme. Overall, the scheme delivers average loop speedups of 7.3 for 16 processors and is 50% faster than a related software-only method.

international conference on supercomputing | 1995

Run-time methods for parallelizing partially parallel loops

Lawrence Rauchwerger; Nancy M. Amato; David A. Padua

In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run–time preprocessing of the loop’s access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop. In addition, it can implement at run–time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelization (element–wise). We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop.

IEEE Parallel & Distributed Technology: Systems & Applications | 1994

Automatic Detection of Parallelism: A grand challenge for high performance computing

William Blume; Rudolf Eigenmann; Jay Hoeflinger; David A. Padua; Paul M. Petersen; Lawrence Rauchwerger; Peng Tu

The limited ability of compilers to find the parallelism in programs is a significant barrier to the use of high-performance computers.However, a combination of static and runtime techniques can improve compilers to the extent that a significant group of scientific programs can be parallelized automatically.

international conference on supercomputing | 1994

The privatizing DOALL test: a run-time technique for DOALL loop identification and array privatization

Lawrence Rauchwerger; David A. Padua

Current parallelizing compilers cannot identify a significant fraction of fully parallel loops because they have complex or statically insufficiently defined access patterns. For this reason, we have developed the Privatizing DOALL test—a technique for identifying fully parallel loops at run-time, and dynamically privatizing scalars and arrays. The test itself is fully parallel, and can be applied to any loop, regardless of the structure of its data and/or control flow. The technique can be utilized in two modes: (i) the test is performed before executing the loop and indicates whether the loop can be executed as a DOALL; (ii) speculatively—the loop and the test are executed simultaneously, and it is determined later if the loop was in fact parallel. The test can also be used for debugging parallel programs. We discuss how the test can be inserted automatically by the compiler and outline a cost/performance analysis that can be performed to decide when to use the test. Our conclusion is that the test should almost always be applied—because, as we show, the expected speedup for fully parallel loops is significant, and the cost of a failed test (a not fully parallel loop), is minimal. We present some experimental results on loops from the PERFECT Benchmarks which confirm our conclusion that this test can lead to significant speedups.

languages and compilers for parallel computing | 1994

Polaris: Improving the Effectiveness of Parallelizing Compilers

William Blume; Rudolf Eigenmann; Keith A. Faigin; John R. Grout; Jay Hoeflinger; David A. Padua; Paul M. Petersen; William M. Pottenger; Lawrence Rauchwerger; Peng Tu; Stephen A. Weatherford

It is the goal of the Polaris project to develop a new parallelizing compiler that will overcome limitations of current compilers. While current parallelizing compilers may succeed on small kernels, they often fail to extract any meaningful parallelism from large applications. After a study of application codes, it was concluded that by adding a few new techniques to current compilers, automatic parallelization becomes possible. The techniques needed are interprocedural analysis, scalar and array privatization, symbolic dependence analysis, and advanced induction and reduction recognition and elimination, along with run-time techniques to allow data dependent behavior.

international conference on supercomputing | 2000

Adaptive reduction parallelization techniques

Hao Yu; Lawrence Rauchwerger

In this paper, we propose to adapt parallelizing transformations, more specifically, reduction parallelizations, to the actual reference pattern executed by a loop, i.e., to the particular input data and dynamic phase of a program. More precisely we will show how, after validating a reduction at run-time (when this is not possible at compile time) we can dynamically characterize its reference pattern and choose the most appropriate method for parallelizing it. For this purpose, we develop a library of parallel reduction algorithms, including both previously known and novel schemes, which includes algorithms specialized for different classes of access behavior. In particular, each algorithm in our library has identified strengths related to specific reference pattern characteristics, which are matched, at run-time, with measured characteristics of the actual reference pattern. The matching of algorithm to reference pattern is performed using a decision-tree based selection scheme. The contribution of this work consists in new optimizations for reduction parallelization and in the introduction of a new approach to the optimization of irregular applications: Characteristic based customization.

Explore More