Rudolf Eigenmann
Purdue University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rudolf Eigenmann.
acm sigplan symposium on principles and practice of parallel programming | 2009
Seyong Lee; Seung-Jai Min; Rudolf Eigenmann
GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial).
IEEE Computer | 1996
William Blume; Ramón Doallo; Rudolf Eigenmann; John R. Grout; Jay Hoeflinger; Thomas R. Lawrence
Parallel programming tools are limited, making effective parallel programming difficult and cumbersome. Compilers that translate conventional sequential programs into parallel form would liberate programmers from the complexities of explicit, machine oriented parallel programming. The paper discusses parallel programming with Polaris, an experimental translator of conventional Fortran programs that target machines such as the Cray T3D.
Proceedings of the IEEE | 1993
Utpal Banerjee; Rudolf Eigenmann; Alexandru Nicolau; David A. Padua
An overview of automatic program parallelization techniques is presented. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do-loop transformations, and parallelization of recursive routines. Several experimental studies on the effectiveness of parallelizing compilers are surveyed. >
international workshop on openmp | 2001
Vishal Aslot; Max J. Domeika; Rudolf Eigenmann; Greg Gaertner; Wesley B. Jones; Bodo K. Parady
We present a new benchmark suite for parallel computers. SPEComp targets mid-size parallel servers. It includes a number of science/engineering and data processing applications. Parallelism is expressed in the OpenMP API. The suite includes two data sets, Medium and Large, of approximately 1.6 and 4 GB in size. Our overview also describes the organization developing SPEComp, issues in creating OpenMP parallel benchmarks, the benchmarking methodology underlying SPEComp, and basic performance characteristics.
ieee international conference on high performance computing data and analytics | 2010
Seyong Lee; Rudolf Eigenmann
General-Purpose Graphics Processing Units (GPGPUs) are promising parallel platforms for high performance computing. The CUDA (Compute Unified Device Architecture) programming model provides improved programmability for general computing on GPGPUs. However, its unique execution model and memory model still pose significant challenges for developers of efficient GPGPU code. This paper proposes a new programming interface, called OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations. We have developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC. In addition to a range of compiler transformations and optimizations, the system includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants. Our results demonstrate that OpenMPC offers both programmability and tunability. Our system achieves 88% of the performance of the hand-coded CUDA programs.
IEEE Computer | 2009
Chirag Dave; Hansang Bae; Seung-Jai Min; Seyong Lee; Rudolf Eigenmann; Samuel P. Midkiff
The Cetus tool provides an infrastructure for research on multicore compiler optimizations that emphasizes automatic parallelization. The compiler infrastructure, which targets C programs, supports source-to-source transformations, is user-oriented and easy to handle, and provides the most important parallelization passes as well as the underlying enabling techniques.
IEEE Transactions on Parallel and Distributed Systems | 1992
William Blume; Rudolf Eigenmann
The speedups of the Perfect Benchmarks codes that result from automatic parallelization are reported. The performance gains caused by individual restructuring techniques have also been measured. Specific reasons for the successes and failures of the transformations are discussed, and potential improvements that result in measurably better program performance are analyzed. The most important findings are that available restructurers often cause insignificant performance gains in real programs and that only few restructuring techniques contribute to this gain. However, it can be shown that there is potential for advancing compiler technology so that many of the most important loops in these programs can be parallelized. >
symposium on code generation and optimization | 2006
Zhelong Pan; Rudolf Eigenmann
Although compile-time optimizations generally improve program performance, degradations caused by individual techniques are to be expected. One promising research direction to overcome this problem is the development of dynamic, feedback-directed optimization orchestration algorithms, which automatically search for the combination of optimization techniques that achieves the best program performance. The challenge is to develop an orchestration algorithm that finds, in an exponential search space, a solution that is close to the best, in acceptable time. In this paper, we build such a fast and effective algorithm, called combined elimination (CE). The key advance of CE over existing techniques is that it takes the least tuning time (57% of the closest alternative), while achieving the same program performance. We conduct the experiments on both a Pentium IV machine and a SPARC II machine, by measuring performance of SPEC CPU2000 benchmarks under a large set of 38 GCC compiler options. Furthermore, through orchestrating a small set of optimizations causing the most degradation, we show that the performance achieved by CE is close to the upper bound obtained by an exhaustive search algorithm. The gap is less than 0.2% on average.
languages and compilers for parallel computing | 2003
Sang Ik Lee; Troy A. Johnson; Rudolf Eigenmann
Cetus is a compiler infrastructure for the source-to-source transformation of programs. We created Cetus out of the need for a compiler research environment that facilitates the development of interprocedural analysis and parallelization techniques for C, C++, and Java programs. We will describe our rationale for creating a new compiler infrastructure and give an overview of the Cetus architecture. The design is intended to be extensible for multiple languages and will become more flexible as we incorporate feedback from any difficulties we encounter introducing other languages. We will characterize Cetus’ runtime behavior of parsing and IR generation in terms of execution time, memory usage, and parallel speedup of parsing, as well as motivate its usefulness through examples of projects that use Cetus. We will then compare these results with those of the Polaris Fortran translator.
programming language design and implementation | 2004
Troy A. Johnson; Rudolf Eigenmann; T. N. Vijaykumar
With billion-transistor chips on the horizon, single-chip multiprocessors (CMPs) are likely to become commodity components. Speculative CMPs use hardware to enforce dependence, allowing the compiler to improve performance by speculating on ambiguous dependences without absolute guarantees of independence. The compiler is responsible for decomposing a sequential program into speculatively parallel threads, while considering multiple performance overheads related to data dependence, load imbalance, and thread prediction. Although the decomposition problem lends itself to a min-cut-based approach, the overheads depend on the thread size, requiring the edge weights to be changed as the algorithm progresses. The changing weights make our approach different from graph-theoretic solutions to the general problem of task scheduling. One recent work uses a set of heuristics, each targeting a specific overhead in isolation, and gives precedence to thread prediction, without comparing the performance of the threads resulting from each heuristic. By contrast, our method uses a sequence of balanced min-cuts that give equal consideration to all the overheads, and adjusts the edge weights after every cut. This method achieves an (geometric) average speedup of 74% for floating-point programs and 23% for integer programs on a four-processor chip, improving on the 52% and 13% achieved by the previous heuristics.