Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Rolf Rabenseifner is active.

Publication


Featured researches published by Rolf Rabenseifner.


ieee international conference on high performance computing data and analytics | 2005

Optimization of Collective Communication Operations in MPICH

Rajeev Thakur; Rolf Rabenseifner; William Gropp

We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI (Message Passing Interface) collective operations, because of limited space we describe only the algorithms for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBMs MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and nonpower-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. Our results indicate that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.


parallel, distributed and network-based processing | 2009

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

Rolf Rabenseifner; Georg Hager; Gabriele Jost

Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: Shared memory nodes with several multi-core CPUs are connected via a network infrastructure. Parallel programming must combine distributed memory parallelization on the node interconnect with shared memory parallelization inside each node. We describe potentials and challenges of the dominant programming models on hierarchically structured hardware: Pure MPI (Message Passing Interface), pure OpenMP (with distributed shared memory extensions) and hybrid MPI+OpenMP in several ¿avors. We pinpoint cases where a hybrid programming model can indeed be the superior solution because of reduced communication needs and memory consumption, or improved load balance. Furthermore we show that machine topology has a signi¿cant impact on performance for all parallelization strategies and that topology awareness should be built into all applications in the future. Finally we give an outlook on possible standardization goals and extensions that could make hybrid programming easier to do with performance in mind.


international conference on computational science | 2004

Optimization of Collective Reduction Operations

Rolf Rabenseifner

A 5-year-profiling in production mode at the University of Stuttgart has shown that more than 40% of the execution time of Message Passing Interface (MPI) routines is spent in the collective communication routines MPI_Allreduce and MPI_Reduce. Although MPI implementations are now available for about 10 years and all vendors are committed to this Message Passing Interface standard, the vendors’ and publicly available reduction algorithms could be accelerated with new algorithms by a factor between 3 (IBM, sum) and 100 (Cray T3E, maxloc) for long vectors. This paper presents five algorithms optimized for different choices of vector size and number of processes. The focus is on bandwidth dominated protocols for power-of-two and non-power-of-two number of processes, optimizing the load balance in communication and computation.


ieee international conference on high performance computing data and analytics | 2003

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Rolf Rabenseifner; Gerhard Wellein

Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also on whether programming paradigms can separate the optimization of communication and computation. Benchmark results are presented for hybrid and pure MPI communication. This paper analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes.


Lecture Notes in Computer Science | 2004

More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems

Rolf Rabenseifner; Jesper Larsson Träff

We present improved algorithms for global reduction operations for message-passing systems. Each of p processors has a vector of m data items, and we want to compute the element-wise “sum” under a given, associative function of the p vectors. The result, which is also a vector of m items, is to be stored at either a given root processor (MPI_Reduce), or all p processors (MPI_Allreduce). A further constraint is that for each data item and each processor the result must be computed in the same order, and with the same bracketing. Both problems can be solved in O(m+log2 p) communication and computation time. Such reduction operations are part of MPI (the Message Passing Interface), and the algorithms presented here achieve significant improvements over currently implemented algorithms for the important case where p is not a power of 2. Our algorithm requires ⌈log2 p⌉ + 1 rounds – one round off from optimal – for small vectors. For large vectors twice the number of rounds is needed, but the communication and computation time is less than 3mβ and 3/2mγ, respectively, an improvement from 4mβ and 2mγ achieved by previous algorithms (with the message transfer time modeled as α + mβ, and reduction-operation execution time as mγ). For p=3× 2 n and p=9× 2 n and small m ≤ b for some threshold b, and p=q 2 n with small q, our algorithm achieves the optimal ⌈log2 p⌉ number of rounds.


Concurrency and Computation: Practice and Experience | 2011

The scalable process topology interface of MPI 2.2

Torsten Hoefler; Rolf Rabenseifner; Hubert Ritzdorf; Bronis R. de Supinski; Rajeev Thakur; Jesper Larsson Träff

The Message‐passing Interface (MPI) standard provides basic means for adaptations of the mapping of MPI process ranks to processing elements to better match the communication characteristics of applications to the capabilities of the underlying systems. The MPI process topology mechanism enables the MPI implementation to rerank processes by creating a new communicator that reflects user‐supplied information about the application communication pattern. With the newly released MPI 2.2 version of the MPI standard, the process topology mechanism has been enhanced with new interfaces for scalable and informative user‐specification of communication patterns. Applications with relatively static communication patterns are encouraged to take advantage of the mechanism whenever convenient by specifying their communication pattern to the MPI library. Reference implementations of the new mechanism can be expected to be readily available (and come at essentially no cost), but non‐trivial implementations pose challenging problems for the MPI implementer. This paper is first and foremost addressed to application programmers wanting to use the new process topology interfaces. It explains the use and the motivation for the enhanced interfaces and the advantages gained even with a straightforward implementation. For the MPI implementer, the paper summarizes the main issues in the efficient implementation of the interface and explains the optimization problems that need to be (approximately) solved by a good MPI library. Copyright


ieee international conference on high performance computing data and analytics | 2009

Communication Bandwidth of Parallel Programming Models on Hybrid Architectures

Rolf Rabenseifner

Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node inter-connect with the shared memory parallelization inside of each node. This paper introduces several programming models for hybrid systems. It focuses on programming methods that can achieve optimal inter-node communication bandwidth and on the hybrid MPI+OpenMP approach and its programming rules. The communication behavior is compared with the pure MPI programming paradigm and with RDMA and NUMA based programming models.


Journal of Computer and System Sciences | 2008

Performance evaluation of supercomputers using HPCC and IMB Benchmarks

Subhash Saini; Robert Ciotti; Brian T. N. Gunney; Thomas E. Spelce; Alice Koniges; Don Dossa; Panagiotis Adamidis; Rolf Rabenseifner; Sunil R. Tiyyagura; Matthias S. Mueller

The HPC Challenge (HPCC) Benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers-SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon Cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC Benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks results to study the performance of 11 MPI communication functions on these systems.


HPSC | 2005

Comparison of Parallel Programming Models on Clusters of SMP Nodes

Rolf Rabenseifner; Gerhard Wellein

Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside of each node. Various hybrid MPI+OpenMP programming models are compared with pure MPI. Benchmark results of several platforms are presented. This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. There are several mismatch problems between the (hybrid) programming schemes and the hybrid hardware architectures. Benchmark results on a Myrinet cluster and on recent Cray, NEC, IBM, Hitachi, SUN and SGI platforms show, that the hybrid-masteronly programming model can be used more efficiently on some vector-type systems, but also on clusters of dual-CPUs. On other systems, one CPU is not able to saturate the inter-node network and the commonly used masteronly programming model suffers from insufficient inter-node bandwidth. This paper analyses strategies to overcome typical drawbacks of this easily usable programming scheme on systems with weaker inter-connects. Best performance can be achieved with overlapping communication and computation, but this scheme is lacking in ease of use.


european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2001

Effective Communication and File-I/O Bandwidth Benchmarks

Rolf Rabenseifner; Alice Koniges

We describe the design and MPI implementation of two benchmarks created to characterize the balanced system performance of high-performance clusters and supercomputers: b eff, the communication-specific benchmark examines the parallel message passing performance of a system, and b e_ff_io, which characterizes the effective I/O bandwidth. Both benchmarks have two goals: a) to get a detailed insight into the performance strengths and weaknesses of different parallel communication and I/O patterns, and based on this, b) to obtain a single bandwidth number that characterizes the average performance of the system namely communication and I/O bandwidth. Both benchmarks use a time-driven approach and loop over a variety of communication and access patterns to characterize a system in an automated fashion. Results of the two benchmarks are given for several systems including IBM SPs, Cray T3E, NEC SX-5, and Hitachi SR 8000. After a redesign of b eff io, I/O bandwidth results for several compute partition sizes are achieved in an appropriate time for rapid benchmarking.

Collaboration


Dive into the Rolf Rabenseifner's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Alice Koniges

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Felix Wolf

Technische Universität Darmstadt

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Georg Hager

University of Erlangen-Nuremberg

View shared research outputs
Top Co-Authors

Avatar

Gerhard Wellein

University of Erlangen-Nuremberg

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge