Christian Terboven
RWTH Aachen University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Christian Terboven.
international conference on parallel processing | 2012
Sandra Wienke; Paul Springer; Christian Terboven; Dieter an Mey
Todays trend to use accelerators like GPGPUs in heterogeneous computer systems has entailed several low-level APIs for accelerator programming. However, programming these APIs is often tedious and therefore unproductive. To tackle this problem, recent approaches employ directive-based high-level programming for accelerators. In this work, we present our first experiences with OpenACC, an API consisting of compiler directives to offload loops and regions of C/C++ and Fortran code to accelerators. We compare the performance of OpenACC to PGI Accelerator and OpenCL for two real-world applications and evaluate programmability and productivity. We find that OpenACC offers a promising ratio of development effort to performance and that a directive-based approach to program accelerators is more efficient than low-level APIs, even if suboptimal performance is achieved.
international conference on parallel processing | 2013
Dirk Schmidl; Tim Cramer; Sandra Wienke; Christian Terboven; Matthias S. Müller
The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its native x86 support, the Xeon Phi may also be used stand-alone, meaning codes can be executed directly on the device without the need for interaction with a host. In this sense, the Xeon Phi resembles a big SMP on a chip if its 240 logical cores are compared to a common Xeon-based compute node offering up to 32 logical cores. In this work, we compare a Xeon-based two-socket compute node with the Xeon Phi stand-alone in scalability and performance using OpenMP codes. Considering both as individual SMP systems, they come at a very similar price and power envelope, but our results show significant differences in absolute application performance and scalability. We also show in how far common programming idioms for the Xeon multi-core architecture are applicable for the Xeon Phi many-core architecture and which challenges the changing ratio of core count to single core performance poses for the application programmer.
international workshop on openmp | 2004
Yuan Lin; Christian Terboven; Dieter an Mey; Nawal Copty
The process of manually specifying scopes of variables when writing an OpenMP program is both tedious and error-prone. To improve productivity, an autoscoping feature was proposed in [1]. This feature leverages the analysis capability of a compiler to determine the appropriate scopes of variables. In this paper, we present the proposed autoscoping rules and describe the autoscoping feature provided in the Sun StudioTM 9 Fortran 95 compiler. To investigate how much work can be saved by using autoscoping and the performance impact of this feature, we study the process of parallelizing PANTA, a 50,000-line 3D Navier-Stokes solver, using OpenMP. With pure manual scoping, a total of 1389 variables have to be explicitly privatized by the programmer. With the help of autoscoping, only 13 variables have to be manually scoped. Both versions of PANTA achieve the same performance.
international workshop on openmp | 2007
Christian Terboven; Dieter an Mey; Samuel Sarholz
Dualcore processors are already ubiquitous, quadcore processors will spread out this year, systems with a larger number of cores exist, and more are planned. Some cores even execute multiple threads. Are these processors just SMP systems on a chip? Is OpenMP ready to be used for these architectures? We take a look at the cache and memory architecture of some popular modern processors using kernel programs and some application codes which have been developed for large shared memory machines beforehand.
european conference on parallel processing | 2014
Sandra Wienke; Christian Terboven; James C. Beyer; Matthias S. Müller
Nowadays, HPC systems frequently emerge as clusters of commodity processors with attached accelerators. Moving from tedious low-level accelerator programming to increased development productivity, the directive-based programming models OpenACC and OpenMP are promising candidates. While OpenACC was completed about two years ago, OpenMP just recently added support for accelerator programming. To assist developers in their decision-making which approach to take, we compare both models with respect to their programmability. Besides investigating their expressiveness by putting their constructs side by side, we focus on the evaluation of their power based on structured parallel programming patterns (aka algorithmic skeletons). These patterns describe the basic entities of parallel algorithms of which we cover the patterns map, stencil, reduction, fork-join, superscalar sequence, nesting and geometric decomposition. Architectural targets of this work are NVIDIA-type accelerators (GPUs) and specialties of Intel-type accelerators (Xeon Phis). Additionally, we assess the prospects of OpenACC and OpenMP concerning future development in soft- and hardware design.
international workshop on openmp | 2012
Christian Terboven; Dirk Schmidl; Tim Cramer; Dieter an Mey
The introduction of task-level parallelization promises to raise the level of abstraction compared to thread-centric expression of parallelism. However, tasks might exhibit poor performance on NUMA systems if locality cannot be maintained. In contrast to traditional OpenMP worksharing constructs for which threads can be bound, the behavior of tasks is much less predetermined by the OpenMP specification and implementations have a high degree of freedom implementing task scheduling. Employing different approaches to express task-parallelism, namely the single-producer and parallel-producer patterns with different data initialization strategies, we compare the behavior and quality of OpenMP implementations with task-parallel codes on NUMA architectures. For the programmer, we propose recipies to express parallelism with tasks allowing to preserve data locality while optimizing the degree of parallelism. Our proposals are evaluated on reasonably large NUMA systems with both important application kernels as well as a real-world simulation code.
international workshop on openmp | 2008
Christian Terboven; Dieter an Mey; Dirk Schmidl; Marcus Wagner
MPI and OpenMP are the de-facto standards for distributedmemoryand shared-memory parallelization, respectively. By employinga hybrid approach, that is combing OpenMP and MPI parallelization inone program, a cluster of SMP systems can be exploited. Nevertheless,mixing programming paradigms and writing explicit message passingcode might increase the parallel program development time significantly.Intel Cluster OpenMP is the first commercially available OpenMP implementationfor a cluster, aiming to combine the ease of use of the OpenMPparallelization paradigm with the cost efficiency of a commodity cluster.In this paper we present our first experiences with Intel Cluster OpenMP.
International Journal of Parallel Programming | 2007
Dieter an Mey; Samuel Sarholz; Christian Terboven
OpenMP is widely accepted as a de facto standard for shared memory parallel programming in Fortran, C and C++. Nested parallelization has been included in the first OpenMP specification, but it took a few years until the first commercially available compilers supported this optional part of the specification. We employed nested parallelization using OpenMP in three production codes: a C++ code for content-based image retrieval, a C++ code for the computation of critical points in multi-block CFD datasets, and a multi-block Navier-Stokes solver written in Fortran90. In this paper we discuss the opportunities as well as the deficiencies of the nested parallelization support in OpenMP.
international conference on cluster computing | 2010
Dirk Schmidl; Christian Terboven; Andreas Wolf; Dieter an Mey; Christian H. Bischof
The novel ScaleMP vSMP architecture employs commodity x86-based servers with an InfiniBand network to assemble a large shared memory system at an attractive price point. We examine this combined hardware- and softwareapproach of a DSM system using both system-level kernel benchmarks as well as real-world application codes. We compare this architecture with traditional shared memory machines and elaborate on strategies to tune application codes parallelized with OpenMP on multiple levels. Finally we summarize the necessary conditions which a scalable application has to fulfill in order to profit from the full potential of the ScaleMP approach.
international workshop on openmp | 2005
Christian Terboven; Alexander Spiegel; Dieter an Mey; Sven Gross; Volker Reichelt
In order to speed-up the Navier-Stokes solver DROPS, which is developed at the IGPM (Institut fur Geometrie und Praktische Mathematik) at the RWTH Aachen University, the most compute intense parts have been tuned and parallelized using OpenMP. The combination of the employed template programming techniques of the C++ programming language and the OpenMP parallelization approach caused problems with many C++ compilers, and the performance of the parallel version did not meet the expectations.