Thomas Rauber | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thomas Rauber is active.

Explore More

Publication

Featured researches published by Thomas Rauber.

Journal of Systems Architecture | 1999

Compiler support for task scheduling in hierarchical execution models

Thomas Rauber; Gudula Rünger

Algorithms from scientific computing often exhibit a two-level parallelism based on potential method parallelism and potential system parallelism. We consider the parallel implementation of those algorithms on distributed memory machines. The two-level potential parallelism of algorithms is expressed in a specification consisting of an upper level hierarchy of multiprocessor tasks each of which has an internal structure of uniprocessor tasks. To achieve an optimal parallel execution time, the parallel execution of such a program requires an optimal scheduling of the multiprocessor tasks and an appropriate treatment of uniprocessor tasks. For an important subclass of structured method parallelism we present a scheduling methodology which takes data redistributions between multiprocessor tasks into account. As costs we use realistic parallel runtimes. The scheduling methodology is designed for an integration into a parallel compiler tool. We illustrate the multitask scheduling by several examples from numerical analysis.

Concurrency and Computation: Practice and Experience | 2004

A comparison of task pools for dynamic load balancing of irregular algorithms

Matthias Korch; Thomas Rauber

Since a static work distribution does not allow for satisfactory speed‐ups of parallel irregular algorithms, there is a need for a dynamic distribution of work and data that can be adapted to the runtime behavior of the algorithm. Task pools are data structures which can distribute tasks dynamically to different processors where each task specifies computations to be performed and provides the data for these computations. This paper discusses the characteristics of task‐based algorithms and describes the implementation of selected types of task pools for shared‐memory multiprocessors. Several task pools have been implemented in C with POSIX threads and in Java. The task pools differ in the data structures to store the tasks, the mechanism to achieve load balance, and the memory manager used to store the tasks. Runtime experiments have been performed on three different shared‐memory systems using a synthetic algorithm, the hierarchical radiosity method, and a volume rendering algorithm. Copyright

IEEE Transactions on Software Engineering | 2000

A transformation approach to derive efficient parallel implementations

Thomas Rauber; Gudula Rünger

The construction of efficient parallel programs usually requires expert knowledge in the application area and a deep insight into the architecture of a specific parallel machine. Often, the resulting performance is not portable, i.e., a program that is efficient on one machine is not necessarily efficient on another machine with a different architecture. Transformation systems provide a more flexible solution. They start with a specification of the application problem and allow the generation of efficient programs for different parallel machines. The programmer has to give an exact specification of the algorithm expressing the inherent degree of parallelism and is released from the low-level details of the architecture. We propose such a transformation system with an emphasis on the exploitation of the data parallelism combined with a hierarchically organized structure of task parallelism. Starting with a specification of the maximum degree of task and data parallelism, the transformations generate a specification of a parallel program for a specific parallel machine. The transformations are based on a cost model and are applied in a predefined order, fixing the most important design decisions like the scheduling of independent multitask activations, data distributions, pipelining of tasks, and assignment of processors to task activations. We demonstrate the usefulness of the approach with examples from scientific computing.

Journal of Parallel and Distributed Computing | 2006

Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Matthias Korch; Thomas Rauber

The increasing gap between the speeds of processors and main memory has led to hardware architectures with an increasing number of caches to reduce average memory access times. Such deep memory hierarchies make the sequential and parallel efficiency of computer programs strongly dependent on their memory access pattern. In this paper, we consider embedded Runge-Kutta methods for the solution of ordinary differential equations and study their efficient implementation on different parallel platforms. In particular, we focus on ordinary differential equations which are characterized by a special access pattern as it results from the spatial discretization of partial differential equations by the method of lines. We explore how the potential parallelism in the stage vector computation of such equations can be exploited in a pipelining approach leading to a better locality behavior and a higher scalability. Experiments show that this approach results in efficiency improvements on several recent sequential and parallel computers.

Computer Languages, Systems & Structures | 2011

Memory-optimal evaluation of expression trees involving large objects

Chi-Chung Lam; Thomas Rauber; Gerald Baumgartner; Daniel Cociorva; P. Sadayappan

The need to evaluate expression trees involving large objects arises in scientific computing applications such as electronic structure calculations. Often, the tree node objects are so large that only a subset of them can fit into memory at a time. This paper addresses the problem of finding an evaluation order of the nodes in a given expression tree that uses the least amount of memory. We present an algorithm that finds an optimal evaluation order in @Q(nlog^2n) time for an n-node expression tree and prove its correctness. We demonstrate the utility of our algorithm using representative equations from quantum chemistry.

Journal of Parallel and Distributed Computing | 2005

Tlib-a library to support programming with hierarchical multi-processor tasks

Thomas Rauber; Gudula Rünger

The paper considers the modular programming with hierarchically structured multi-processor tasks on top of SPMD tasks for distributed memory machines. The parallel execution requires a corresponding decomposition of the set of processors into a hierarchical group structure onto which the tasks are mapped. The result is a multi-level group SPMD computation model with varying processor group structures. The advantage of this kind of mixed task and data parallelism is a potential to reduce the communication overhead and to increase scalability. We present a runtime library to support the coordination of hierarchically structured multi-processor tasks. The library exploits an extended parallel group SPMD programming model and manages the entire task execution including the dynamic hierarchy of processor groups. The library is built on top of MPI, has an easy-to-use interface, and leads to only a marginal overhead while allowing static planning and dynamic restructuring.

conference on high performance computing (supercomputing) | 2004

Performance Evaluation of Task Pools Based on Hardware Synchronization

Ralf Hoffmann; Matthias Korch; Thomas Rauber

A task-based execution provides a universal approach to dynamic load balancing for irregular applications. Tasks are arbitrary units of work that are created dynamically at run-time and that are stored in a parallel data structure, the task pool, until they are scheduled onto a processor for execution. In this paper, we evaluate the performance of different task pool implementations for shared-memory computer systems using several realistic applications. We consider task pools with different data structures, different load balancing strategies and a specialized memory management. In particular, we use synchronization operations based on hardware support that is available on many modern micro-processors. We show that the resulting task pool implementations lead to a much better performance than implementations using Pthreads library calls for synchronization. The applications considered are parallel quicksort, volume rendering, ray tracing, and hierarchical radiosity. The target machines are an IBM p690 server and a SunFire 6800.

conference on high performance computing (supercomputing) | 2002

Library Support for Hierarchical Multi-Processor Tasks

Thomas Rauber; Gudula Rünger

The paper considers the modular programming with hierarchically structured multi-processor tasks on top of SPMD tasks for distributed memory machines. The parallel execution requires a corresponding decomposition of the set of processors into a hierarchical group structure onto which the tasks are mapped. This results in a multi-level group SPMD computation model with varying processor group structures. The advantage of this kind of mixed task and data parallelism is a potential to reduce the communication overhead and to increase scalability. We present a runtime library to support the coordination of hierarchically structured multi-processor tasks. The library exploits an extended parallel group SPMD programming model and manages the entire task execution including the dynamic hierarchy of processor groups. The library is built on top of MPI, has an easy-to-use interface, and leads to only a marginal overhead while allowing static planning and dynamic restructuring.

international conference on supercomputing | 2004

Multilevel hierarchical matrix multiplication on clusters

Sascha Hunold; Thomas Rauber; Gudula Rünger

Matrix-matrix multiplication is one of the core computations in many algorithms from scientific computing or numerical analysis and many efficient realizations have been invented over the years, including many parallel ones. The current trend to use clusters of PCs or SMPs for scientific computing suggests to revisit matrix-matrix multiplication and investigate efficiency and scalability of different versions on clusters. In this paper we present parallel algorithms for matrix-matrix multiplication which are built up from several algorithms in a multilevel structure. Each level is associated with a hierarchical partition of the set of available processors into disjoint subsets so that deeper levels of the algorithm employ smaller groups of processors in parallel. We perform runtime experiments on several parallel platforms and show that multilevel algorithms can lead to significant performance gains compared with state-of-the-art methods.

international parallel and distributed processing symposium | 2004

A source code analyzer for performance prediction

Matthias Kühnemann; Thomas Rauber; Gudula Rünger

Summary form only given. Performance prediction is necessary and crucial in order to deal with multidimensional performance effects on parallel systems. The increasing use of parallel supercomputers and cluster systems to solve large-scale scientific problems has generated a need for tools that can predict scalability trends of applications written for these machines. In this paper, we describe a compiler tool to automate performance prediction for execution times of parallel programs by runtime formulas in closed form. For an arbitrary parallel MPI source program the tool generates a corresponding runtime function modeling the CPU execution time and the message passing overhead. The environment is proposed to support the development process and the performance engineering activities that accompany the whole software life cycle. The performance prediction tool is shown to be effective in analyzing a representative application for varying problem sizes on several platforms using different numbers of processors.

Explore More