Dirk Schmidl | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dirk Schmidl is active.

Explore More

Publication

Featured researches published by Dirk Schmidl.

Parallel Tools Workshop | 2012

Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope,Scalasca, TAU, and Vampir

Andreas Knüpfer; Christian Rössel; Dieter an Mey; Scott Biersdorff; Kai Diethelm; Dominic Eschweiler; Markus Geimer; Michael Gerndt; Daniel Lorenz; Allen D. Malony; Wolfgang E. Nagel; Yury Oleynik; Peter Philippen; Pavel Saviankou; Dirk Schmidl; Sameer Shende; Ronny Tschüter; Michael Wagner; Bert Wesarg; Felix Wolf

This paper gives an overview about the Score-P performance measurement infrastructure which is being jointly developed by leading HPC performance tools groups. It motivates the advantages of the joint undertaking from both the developer and the user perspectives, and presents the design and components of the newly developed Score-P performance measurement infrastructure. Furthermore, it contains first evaluation results in comparison with existing performance tools and presents an outlook to the long-term cooperative development of the new system.

international conference on parallel processing | 2013

Assessing the performance of OpenMP programs on the intel xeon phi

Dirk Schmidl; Tim Cramer; Sandra Wienke; Christian Terboven; Matthias S. Müller

The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its native x86 support, the Xeon Phi may also be used stand-alone, meaning codes can be executed directly on the device without the need for interaction with a host. In this sense, the Xeon Phi resembles a big SMP on a chip if its 240 logical cores are compared to a common Xeon-based compute node offering up to 32 logical cores. In this work, we compare a Xeon-based two-socket compute node with the Xeon Phi stand-alone in scalability and performance using OpenMP codes. Considering both as individual SMP systems, they come at a very similar price and power envelope, but our results show significant differences in absolute application performance and scalability. We also show in how far common programming idioms for the Xeon multi-core architecture are applicable for the Xeon Phi many-core architecture and which challenges the changing ratio of core count to single core performance poses for the application programmer.

Archive | 2011

Score-P: A Unified Performance Measurement System for Petascale Applications

Dieter an Mey; Scott Biersdorf; Christian H. Bischof; Kai Diethelm; Dominic Eschweiler; Michael Gerndt; Andreas Knüpfer; Daniel Lorenz; Allen D. Malony; Wolfgang E. Nagel; Yury Oleynik; Christian Rössel; Pavel Saviankou; Dirk Schmidl; Sameer Shende; Michael Wagner; Bert Wesarg; Felix Wolf

The rapidly growing number of cores on modern supercomputers imposes scalability demands not only on applications but also on the software tools needed for their development. At the same time, increasing application and system complexity makes the optimization of parallel codes more difficult, creating a need for scalable performance-analysis technology with advanced functionality. However, delivering such an expensive technology can hardly be accomplished by single tool developers and requires higher degrees of collaboration within the HPC community. The unified performance-measurement system Score-P is a joint effort of several academic performance-tool builders, funded under the BMBF program HPC-Software fur skalierbare Parallelrechner in the SILC project (Skalierbare Infrastruktur zur automatischen Leistungsanalyse paralleler Codes). It is being developed with the objective of creating a common basis for several complementary optimization tools in the service of enhanced scalability, improved interoperability, and reduced maintenance cost.

international workshop on openmp | 2012

Performance analysis techniques for task-based OpenMP applications

Dirk Schmidl; Peter Philippen; Daniel Lorenz; Christian Rössel; Markus Geimer; Dieter an Mey; Bernd Mohr; Felix Wolf

Version 3.0 of the OpenMP specification introduced the task construct for the explicit expression of dynamic task parallelism. Although automated load-balancing capabilities make it an attractive parallelization approach for programmers, the difficulty of integrating this new dimension of parallelism into traditional models of performance data has so far prevented the emergence of appropriate performance tools. Based on our earlier work, where we have introduced instrumentation for task-based programs, we present initial concepts for analyzing the data delivered by this instrumentation. We define three typical performance problems related to tasking and show how they can be visually explored using event traces. Special emphasis is placed on the event model used to capture the execution of task instances and on how the time consumed by the program is mapped onto tasks in the most meaningful way. We illustrate our approach with practical examples.

international workshop on openmp | 2010

How to reconcile event-based performance analysis with tasking in OpenMP

Daniel Lorenz; Bernd Mohr; Christian Rössel; Dirk Schmidl; Felix Wolf

With version 3.0, the OpenMP specification introduced a task construct and with it an additional dimension of concurrency. While offering a convenient means to express task parallelism, the new construct presents a serious challenge to event-based performance analysis. Since tasking may disrupt the classic sequence of region entry and exit events, essential analysis procedures such as reconstructing dynamic call paths or correctly attributing performance metrics to individual task region instances may become impossible. To overcome this limitation, we describe a portable method to distinguish individual task instances and to track their suspension and resumption with event-based instrumentation. Implemented as an extension of the OPARI source-code instrumenter, our portable solution supports C/C++ programs with tied tasks and with untied tasks that are suspended only at implied scheduling points, while introducing only negligible measurement overhead. Finally, we discuss possible extensions of the OpenMP specification to provide general support for task identifiers with untied tasks.

international conference on parallel processing | 2012

Profiling of OpenMP Tasks with Score-P

Daniel Lorenz; Peter Philippen; Dirk Schmidl; Felix Wolf

With the task construct, the OpenMP 3.0 specification introduces an additional level of parallelism that challenges established schemes of performance profiling. First, a thread may execute a sequence of interleaved task fragments the profiling system must properly distinguish to enable correct performance analyses. Furthermore, the additional parallelization dimension requires new visualization methods for presenting analysis results. Finally, as a new programming paradigm, tasking implicitly introduces paradigm-specific performance issues and creates a need for corresponding optimization strategies. This paper presents solutions to overcome the challenges of profiling applications based on OpenMP tasks. Second, the paper describes metrics that may help uncover performance problems related to tasking. We present an implementation of our solution within the Score-P performance measurement system, which we evaluate using the Barcelona OpenMP Task Suite.

international workshop on openmp | 2012

Assessing OpenMP tasking implementations on NUMA architectures

Christian Terboven; Dirk Schmidl; Tim Cramer; Dieter an Mey

The introduction of task-level parallelization promises to raise the level of abstraction compared to thread-centric expression of parallelism. However, tasks might exhibit poor performance on NUMA systems if locality cannot be maintained. In contrast to traditional OpenMP worksharing constructs for which threads can be bound, the behavior of tasks is much less predetermined by the OpenMP specification and implementations have a high degree of freedom implementing task scheduling. Employing different approaches to express task-parallelism, namely the single-producer and parallel-producer patterns with different data initialization strategies, we compare the behavior and quality of OpenMP implementations with task-parallel codes on NUMA architectures. For the programmer, we propose recipies to express parallelism with tasks allowing to preserve data locality while optimizing the degree of parallelism. Our proposals are evaluated on reasonably large NUMA systems with both important application kernels as well as a real-world simulation code.

international workshop on openmp | 2008

First experiences with intel cluster OpenMP

Christian Terboven; Dieter an Mey; Dirk Schmidl; Marcus Wagner

MPI and OpenMP are the de-facto standards for distributedmemoryand shared-memory parallelization, respectively. By employinga hybrid approach, that is combing OpenMP and MPI parallelization inone program, a cluster of SMP systems can be exploited. Nevertheless,mixing programming paradigms and writing explicit message passingcode might increase the parallel program development time significantly.Intel Cluster OpenMP is the first commercially available OpenMP implementationfor a cluster, aiming to combine the ease of use of the OpenMPparallelization paradigm with the cost efficiency of a commodity cluster.In this paper we present our first experiences with Intel Cluster OpenMP.

international conference on cluster computing | 2010

How to Scale Nested OpenMP Applications on the ScaleMP vSMP Architecture

Dirk Schmidl; Christian Terboven; Andreas Wolf; Dieter an Mey; Christian H. Bischof

The novel ScaleMP vSMP architecture employs commodity x86-based servers with an InfiniBand network to assemble a large shared memory system at an attractive price point. We examine this combined hardware- and softwareapproach of a DSM system using both system-level kernel benchmarks as well as real-world application codes. We compare this architecture with traditional shared memory machines and elaborate on strategies to tune application codes parallelized with OpenMP on multiple levels. Finally we summarize the necessary conditions which a scalable application has to fulfill in order to profit from the full potential of the ScaleMP approach.

international workshop on openmp | 2010

Binding nested OpenMP programs on hierarchical memory architectures

Dirk Schmidl; Christian Terboven; Dieter an Mey; H. Martin Bücker

In this work we discuss the performance problems of nested OpenMP programs concerning thread and data locality particularly on cc-NUMA architectures. We provide a user friendly solution and demonstrate its benefits by comparing the performance of some kernel benchmarks and some real-world applications with and without applying our affinity optimizations.

Explore More