Dieter an Mey | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dieter an Mey is active.

Explore More

Publication

Featured researches published by Dieter an Mey.

international conference on parallel processing | 2012

OpenACC: first experiences with real-world applications

Sandra Wienke; Paul Springer; Christian Terboven; Dieter an Mey

Todays trend to use accelerators like GPGPUs in heterogeneous computer systems has entailed several low-level APIs for accelerator programming. However, programming these APIs is often tedious and therefore unproductive. To tackle this problem, recent approaches employ directive-based high-level programming for accelerators. In this work, we present our first experiences with OpenACC, an API consisting of compiler directives to offload loops and regions of C/C++ and Fortran code to accelerators. We compare the performance of OpenACC to PGI Accelerator and OpenCL for two real-world applications and evaluate programmability and productivity. We find that OpenACC offers a promising ratio of development effort to performance and that a directive-based approach to program accelerators is more efficient than low-level APIs, even if suboptimal performance is achieved.

Parallel Tools Workshop | 2012

Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope,Scalasca, TAU, and Vampir

Andreas Knüpfer; Christian Rössel; Dieter an Mey; Scott Biersdorff; Kai Diethelm; Dominic Eschweiler; Markus Geimer; Michael Gerndt; Daniel Lorenz; Allen D. Malony; Wolfgang E. Nagel; Yury Oleynik; Peter Philippen; Pavel Saviankou; Dirk Schmidl; Sameer Shende; Ronny Tschüter; Michael Wagner; Bert Wesarg; Felix Wolf

This paper gives an overview about the Score-P performance measurement infrastructure which is being jointly developed by leading HPC performance tools groups. It motivates the advantages of the joint undertaking from both the developer and the user perspectives, and presents the design and components of the newly developed Score-P performance measurement infrastructure. Furthermore, it contains first evaluation results in comparison with existing performance tools and presents an outlook to the long-term cooperative development of the new system.

Archive | 2011

Score-P: A Unified Performance Measurement System for Petascale Applications

Dieter an Mey; Scott Biersdorf; Christian H. Bischof; Kai Diethelm; Dominic Eschweiler; Michael Gerndt; Andreas Knüpfer; Daniel Lorenz; Allen D. Malony; Wolfgang E. Nagel; Yury Oleynik; Christian Rössel; Pavel Saviankou; Dirk Schmidl; Sameer Shende; Michael Wagner; Bert Wesarg; Felix Wolf

The rapidly growing number of cores on modern supercomputers imposes scalability demands not only on applications but also on the software tools needed for their development. At the same time, increasing application and system complexity makes the optimization of parallel codes more difficult, creating a need for scalable performance-analysis technology with advanced functionality. However, delivering such an expensive technology can hardly be accomplished by single tool developers and requires higher degrees of collaboration within the HPC community. The unified performance-measurement system Score-P is a joint effort of several academic performance-tool builders, funded under the BMBF program HPC-Software fur skalierbare Parallelrechner in the SILC project (Skalierbare Infrastruktur zur automatischen Leistungsanalyse paralleler Codes). It is being developed with the objective of creating a common basis for several complementary optimization tools in the service of enhanced scalability, improved interoperability, and reduced maintenance cost.

international workshop on openmp | 2004

Automatic scoping of variables in parallel regions of an OpenMP program

Yuan Lin; Christian Terboven; Dieter an Mey; Nawal Copty

The process of manually specifying scopes of variables when writing an OpenMP program is both tedious and error-prone. To improve productivity, an autoscoping feature was proposed in [1]. This feature leverages the analysis capability of a compiler to determine the appropriate scopes of variables. In this paper, we present the proposed autoscoping rules and describe the autoscoping feature provided in the Sun StudioTM 9 Fortran 95 compiler. To investigate how much work can be saved by using autoscoping and the performance impact of this feature, we study the process of parallelizing PANTA, a 50,000-line 3D Navier-Stokes solver, using OpenMP. With pure manual scoping, a total of 1389 variables have to be explicitly privatized by the programmer. With the help of autoscoping, only 13 variables have to be manually scoped. Both versions of PANTA achieve the same performance.

European Journal of Human Genetics | 2004

Efficient two-trait-locus linkage analysis through program optimization and parallelization: application to hypercholesterolemia

Johannes Dietter; Alexander Spiegel; Dieter an Mey; Hans-Joachim Pflug; Hussam Al-Kateb; Katrin Hoffmann; Thomas F. Wienker; Konstantin Strauch

We have optimized and parallelized the GENEHUNTER-TWOLOCUS program that allows to perform linkage analysis with two trait loci in the multimarker context. The optimization of the serial program, before parallelization, results in a speedup of a factor of more than 10. The parallelization affects the two-locus-score calculation, which is predominant in terms of computation time. We obtain perfect speedup, that is, the computation time decreases exactly by a factor of the number of processors. In addition, two-locus LOD and NPL scores are now calculated for varying genetic positions of both disease loci, not just one locus varied and the position of the other disease locus fixed, as before. This results in easily interpretable 3-D plots. We have reanalyzed a pedigree with hypercholesterolemia using our new version of GENEHUNTER-TWOLOCUS. Whereas originally, two individuals had to be discarded due to excessive computation-time demands, the entire 17-bit pedigree could now be analyzed as a whole. We obtain a two-trait-locus LOD score of 5.49 under a multiplicative model, compared to LOD scores of 3.08 and 2.87 under a heterogeneity and additive model, respectively. This further increases evidence for linkage to both 1p36.1–p35 and 13q22–q32 regions, and corroborates the hypothesis that the two genes act in a multiplicative way on LDL cholesterol level. Furthermore, we compare the computation times for two-trait-locus analysis needed by the programs GENEHUNTER-TWOLOCUS, TLINKAGE, and SUPERLINK. Altogether, our algorithmic improvements of GENEHUNTER-TWOLOCUS allow researchers to analyze complex diseases under realistic two-trait-locus models with pedigrees of reasonable size and using many markers.

international workshop on openmp | 2007

OpenMP on Multicore Architectures

Christian Terboven; Dieter an Mey; Samuel Sarholz

Dualcore processors are already ubiquitous, quadcore processors will spread out this year, systems with a larger number of cores exist, and more are planned. Some cores even execute multiple threads. Are these processors just SMP systems on a chip? Is OpenMP ready to be used for these architectures? We take a look at the cache and memory architecture of some popular modern processors using kernel programs and some application codes which have been developed for large shared memory machines beforehand.

Archive | 2013

Euro-Par 2013 Parallel Processing

Felix Wolf; Bernd Mohr; Dieter an Mey

For a long period in the development of computers and computing efficient applications were only characterized by computational – and memory complexity or in more practical terms elapsed computing time and required main memory capacity. The history of Euro-Par and its predecessor-organizations stands for research on the development of ever more powerful computer architectures that shorten the compute time both by faster clocking and by parallel execution as well as the development of algorithms that can exhibit these parallel architectural features. The success of enhancing architectures and algorithms is best described by exponential curves regarding the peak computing power of architectures and the efficiency of algorithms. As microprocessor parts get more and more power hungry and electricity gets more and more expensive, ”energy to solution” is a new optimization criterion for large applications. This calls for energy aware solutions. Components of Energy Aware Computing In order to reduce the power used to run an application, four components have to be optimized, three of them relate to the computer system and the programs to be extended, one relates to the infrastructure of the computer system: – Energy aware infrastructure: This parameter relates to the fact, that computers need climate, cooling, uninterruptable voltage supply, building with light, heating and additional infrastructure components that consume power. Examples for measures to reduce energy are: Use of liquid cooling, direct cooling, free cooling, waste heat reuse, adsorption machines, monitoring and optimizing of energy consumption and infrastructure control, coupling of infrastructure power requirements with behavior of computers and the application execution. – Energy aware system hardware: This parameter describes all mechanisms in new hardware to reduce power in the system itself: sleep modes of inactive parts, clock control of the parts, fine grain hardware-monitoring of F. Wolf, B. Mohr, and D. an Mey (Eds.): Euro-Par 2013, LNCS 8097, pp. 1–2, 2013. c

international workshop on openmp | 2012

Performance analysis techniques for task-based OpenMP applications

Dirk Schmidl; Peter Philippen; Daniel Lorenz; Christian Rössel; Markus Geimer; Dieter an Mey; Bernd Mohr; Felix Wolf

Version 3.0 of the OpenMP specification introduced the task construct for the explicit expression of dynamic task parallelism. Although automated load-balancing capabilities make it an attractive parallelization approach for programmers, the difficulty of integrating this new dimension of parallelism into traditional models of performance data has so far prevented the emergence of appropriate performance tools. Based on our earlier work, where we have introduced instrumentation for task-based programs, we present initial concepts for analyzing the data delivered by this instrumentation. We define three typical performance problems related to tasking and show how they can be visually explored using event traces. Special emphasis is placed on the event model used to capture the execution of task instances and on how the time consumed by the program is mapped onto tasks in the most meaningful way. We illustrate our approach with practical examples.

international workshop on openmp | 2012

Assessing OpenMP tasking implementations on NUMA architectures

Christian Terboven; Dirk Schmidl; Tim Cramer; Dieter an Mey

The introduction of task-level parallelization promises to raise the level of abstraction compared to thread-centric expression of parallelism. However, tasks might exhibit poor performance on NUMA systems if locality cannot be maintained. In contrast to traditional OpenMP worksharing constructs for which threads can be bound, the behavior of tasks is much less predetermined by the OpenMP specification and implementations have a high degree of freedom implementing task scheduling. Employing different approaches to express task-parallelism, namely the single-producer and parallel-producer patterns with different data initialization strategies, we compare the behavior and quality of OpenMP implementations with task-parallel codes on NUMA architectures. For the programmer, we propose recipies to express parallelism with tasks allowing to preserve data locality while optimizing the degree of parallelism. Our proposals are evaluated on reasonably large NUMA systems with both important application kernels as well as a real-world simulation code.

Computer Science - Research and Development | 2012

Brainware for green HPC

Christian H. Bischof; Dieter an Mey; Christian Iwainsky

The reduction of the infrastructural costs of HPC, in particular power consumption, currently is mainly driven by architectural advances in hardware. Recently, in the quest for the EFlop/s, hardware-software codesign has been advocated, owing to the realization that without some software support only heroic programmers could use high-end HPC machines. However, in the topically diverse world of universities, the EFlop/s is still very far off for most users, and yet their computational demands shape the HPC landscape in the foreseeable future. Based on experiences made at RWTH Aachen University and in the context of the distributed Computational Science and Engineering support of the UK HECToR program, we claim based on economic considerations that HPC hard- and software installations need to be complemented by a “brainware” component, i.e., trained HPC specialists supporting performance optimization of users’ codes. This statement itself is not new, and the establishment of simulation labs at HPC centers echoes this fact. However, based on our experiences, we quantify the savings resulting from brainware, thus providing an economic argument that sufficient brainware must be an integral part of any “green” HPC installation. Thus, it also follows that the current HPC funding regimes, which favor iron over staff, are fundamentally flawed, and long-term efficient HPC deployment must emphasize brainware development to a much greater extent.

Explore More