Lucas Roh
Colorado State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lucas Roh.
Software - Practice and Experience | 1997
Christian H. Bischof; Lucas Roh; A. J. Mauer-Oats
In scientific computing, we often require the derivatives ∂f/∂x of a function f expressed as a program with respect to some input parameter(s) x, say. Automatic Differentiation (AD) techniques augment the program with derivative computation by applying the chain rule of calculus to elementary operations in an automated fashion. This article introduces ADIC (Automatic Differentiation of C), a new AD tool for ANSI-C programs. ADIC is currently the only tool for ANSI-C that employs a source-to-source program transformation approach; that is, it takes a C code and produces a new C code that computes the original results as well as the derivatives. We first present ADIC ‘by example’ to illustrate the functionality and ease of use of ADIC and then describe in detail the architecture of ADIC. ADIC incorporates a modular design that provides a foundation for both rapid prototyping of better AD algorithms and their sharing across AD tools for different languages. A component architecture called AIF (Automatic Differentiation Intermediate Form) separates core AD concepts from their language-specific implementation and allows the development of generic AD modules that can be reused directly in other AIF-based AD tools. The language-specific ADIC front-end and back-end canonicalize C programs to make them fit for semantic augmentation and manage, for example, the association of a program variable with its derivative object. We also report on applications of ADIC to a semiconductor device simulator, 3-D CFD grid generator, vehicle simulator, and neural network code.
international symposium on microarchitecture | 1993
A. P. W. Böhm; Walid A. Najjar; Bhanu Shankar; Lucas Roh
Presents a model of coarse grain dataflow execution. The authors present one top down and two bottom up methods for generation of multithreaded code, and evaluate their effectiveness. The bottom up techniques start from a fine-grain dataflow graph and coalesce this into coarse-grain clusters. The top down technique generates clusters directly from the intermediate data dependence graph used for compiler optimizations. The authors discuss the relevant phases in the compilation process. They compare the effectiveness of the strategies by measuring the total number of clusters executed, the total number of instructions executed, cluster size, and number of matches per cluster. It turns out that the top down method generates more efficient code, and larger clusters. However the number of matches per cluster is larger for the top down method, which could incur higher cluster synchronization costs. >
international symposium on symbolic and algebraic computation | 1997
Jason Michael Abate; Christian H. Bischof; Lucas Roh; Alan Carle
This article describes approaches to computing second-order derivatives with automatic differentiation (AD) based on the forward mode and the propagation of univariate Taylor series. Performance results are given that show the speedup possible with these techniques relative to existing approaches. The authors also describe a new source transformation AD module for computing second-order derivatives of C and Fortran codes and the underlying infrastructure used to create a language-independent translation tool.
international conference on functional programming | 1993
Lucas Roh; Walid A. Najjar; A. P. Wim Böhm
Multithreaded or hybrid von Neumann/dataflow execution models have an advantage over the fine-grain dataflow model in that they significantly reduce the run time overhead incurred by matching. In thw paper, we look at two issues related to the evaluation of a coarse-grain dataflow model of execution. The first issue concerns the compilation into a coarsegrain code from a fine-grain one. In this study, the concept of coarse-grain code is captured by clusters which can be thought of se mini-dataflow graphs which execute strictly, deterministically and without blocking. We look at two bottom-up algorithms: the basic block and the dependence sets methods, to partition dataflow graphs into clusters. The second issue is the actual performance of the clusterbaaed execution se several architecture parameters are varied (e.g. number of processors, matching cost, network latency, etc.). From the extensive simulation data we evaluate (1) the potential speedup over the fine-grain execution and (2) the effects of the various architecture parameters on the coarse-grain execution time, allowing us to draw conclusions on their effectiveness. The results indicate that even with a simple bottom-up algorithm for generating clusters, cluster execution offers a good speedup over the fine-grain execution over a wide range of architectures. They also indicate that coarse-grain execution is scalable, tolerates network latency and high matching cost well; it can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with the issue rate of two is sufficient to exploit the internal parallelism of a cluster.
Proceedings of Workshop on Programming Models for Massively Parallel Computers | 1993
W. Bohm; Walid A. Najjar; Bhanu Shankar; Lucas Roh
Presents top-down and bottom-up methods for generating coarse grain dataflow or multithreaded code, and evaluates their effectiveness. The top-down technique generates clusters directly from the intermediate data dependence graph used for compiler optimizations. Bottom-up techniques coalesce fine-grain dataflow code into clusters. We measure the resulting number of clusters executed, cluster size, and number of inputs per cluster, for Livermore and Purdue benchmarks. The top-down method executes less clusters and instructions, but incurs a higher number of matches per cluster, which exemplifies the need for efficient matching of more than two inputs per cluster.
International Journal of Parallel Programming | 1994
Walid A. Najjar; Lucas Roh; A. P. Wim Böhm
In this paper, we study several issues related to the medium grain dataflow model of execution. We present bottom-up compilation of medium grainclusters from a fine grain dataflow graph. We compare thebasic block and thedependence sets algorithms that partition dataflow graphs into clusters. For an extensive set of benchmarks we assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain dataflow execution. We study the performance of medium grain dataflow when several architectural parameters, such as the number of processors, matching cost, and network latency, are varied.The results indicate that medium grain execution offers a good speedup over the fine grain model, that it is scalable, and tolerates network latency and high matching costs well. Medium grain execution can benefit from a higher output bandwidth of a processor and fainally, a simple superscalar processor with an issue rate of two is sufficient to exploit the internal parallelism of a cluster.
Journal of Parallel and Distributed Computing | 1996
Lucas Roh; Walid A. Najjar; A. P. Wim Böhm
The recent advent of multithreaded architectures holds many promises: the exploitation of intrathread locality and the latency tolerance of multithreaded synchronization can result in a more efficient processor utilization and higher scalability. The challenge for a code generation scheme is to make effective use of the underlying hardware by generating large threads with a large degree of internal locality without limiting the program level parallelism or increasing latency. Top-down code generation, where threads are created directly from the compilers intermediate form, is effective at creating a relatively large thread. However, having only a limited view of the code at any one time limits the quality of threads generated. These top-down generated threads can therefore be optimized by global, bottom-up optimization techniques. In this paper, we introduce the Pebbles multithreaded model of computation and analyze a code generation scheme whereby top-down code generation is combined with bottom-up optimizations. We evaluate the effectiveness of this scheme in terms of overall performance and specific thread characteristics such as size, length, instruction level parallelism, number of inputs, and synchronization costs.
Journal of Parallel and Distributed Computing | 2001
Lucas Roh; Bhanu Shankar; A. P. Wim Böhm; Walid A. Najjar
Due to the large amount of potential parallelism, resource management is a critical issue in multithreaded execution. The challenge in code generation is to control the parallelism without reducing the machines ability to exploit it. Controlled parallelism reduces idle time, communication, and delay caused by synchronization. At the same time it increases the potential for exploitation of program data structure locality. In this paper, we evaluate the performance of methods to control program parallelism and resource usage in the context of the fine-grain dataflow execution model. The methods are in themselves not new, but their performance analysis is. The two methods to control parallelism here are slicing and chunking. We present the methods and their compilation strategy and evaluate their effectiveness in terms of run time and matching store occupancy. Communication is categorized in memory, loop, call, and expression communication. Input and output message locality is measured. Two techniques to reduce communication are introduced. Grouping allocates loop and function bodies on one processor and bundling combines messages with the same sender and receiver into one. Their effects on the total communication volume are quantified.
Archive | 2006
Jooyong Kim; Jason Michael Abate; Lucas Roh
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques | 1994
Lucas Roh; Walid A. Najjar; Bhanu Shankar; A. P. Wim Böhm