Marc-André Hermanns | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marc-André Hermanns is active.

Explore More

Publication

Featured researches published by Marc-André Hermanns.

Parallel Tools Workshop | 2010

Recent Developments in the Scalasca Toolset

Markus Geimer; Felix Wolf; Brian J. N. Wylie; Daniel Becker; David Böhme; Wolfgang Frings; Marc-André Hermanns; Bernd Mohr; Zoltán Szebenyi

The number of processor cores on modern supercomputers is increasing from generation to generation, and as a consequence HPC applications are required to harness much higher degrees of parallelism to satisfy their growing demand for computing power. However, writing code that runs efficiently on large processor configurations remains a significant challenge. The situation is exacerbated by the rising number of cores imposing scalability demands not only on applications but also on the software tools needed for their development.

ieee international conference on high performance computing data and analytics | 2012

Scalable detection of MPI-2 remote memory access inefficiency patterns

Marc-André Hermanns; Markus Geimer; Bernd Mohr; Felix Wolf

Wait states in parallel applications can be identified by scanning event traces for characteristic patterns. In our earlier work we defined such inefficiency patterns for MPI-2 one-sided communication, although still based on a serial trace-analysis scheme with limited scalability. In this article we show how wait states in one-sided communications can be detected in a more scalable fashion by taking advantage of a new scalable trace-analysis approach based on a parallel replay, which was originally developed for MPI-1 point-to-point and collective communication. Moreover, we demonstrate the scalability of our method and its usefulness for the optimization cycle with applications running on up to 32,768 cores.

Proceedings of the 20th European MPI Users' Group Meeting on | 2013

Understanding the formation of wait states in applications with one-sided communication

Marc-André Hermanns; Manfred Miklosch; David Böhme; Felix Wolf

To better understand the formation of wait states in MPI programs and to support the user in finding optimization targets in the case of load imbalance, a major source of wait states, we added in our earlier work two new trace-analysis techniques to Scalasca, a performance analysis tool designed for large-scale applications. In this paper, we show how the two techniques, which were originally restricted to two-sided and collective MPI communication, are extended to cover also one-sided communication. We demonstrate our experiences with benchmark programs and a mini-application representing the core of the POP ocean model.

european conference on parallel processing | 2005

Event-Based measurement and analysis of one-sided communication

Marc-André Hermanns; Bernd Mohr; Felix Wolf

To analyze the correctness and the performance of a program, information about the dynamic behavior of all participating processes is needed. The dynamic behavior can be modeled as a stream of events required for a later analysis including appropriate attributes. Based on this idea, kojak, a trace-based toolkit for performance analysis, records and analyzes the activities of mpi-1 point-to-point and collective communication. To support remote-memory access (rma) hardware in a portable way, mpi-2 introduced a standardized interface for remote memory access. However, potential performance gains come at the expense of more complex semantics. From a programmers point of view, an mpi-2 data transfer is only completed after a sequence of communication and associated synchronization calls. This paper describes the integration of performance measurement and analysis methods for rma communication into the kojak toolkit. Special emphasis is put on the underlying event model used to represent the dynamic behavior of mpi-2 rma operations. We show that our model reflects the relationships between communication and synchronization more accurately than existing models. In addition, the model is general enough to also cover alternate but simpler rma interfaces, such as shmem and Co-Array Fortran.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

Scalable Detection of MPI-2 Remote Memory Access Inefficiency Patterns

Marc-André Hermanns; Markus Geimer; Bernd Mohr; Felix Wolf

Wait states in parallel applications can be identified by scanning event traces for characteristic patterns. In our earlier work, we have defined such patterns for mpi -2 one-sided communication, although still based on a trace-analysis scheme with limited scalability. Taking advantage of a new scalable trace-analysis approach based on a parallel replay, which was originally developed for mpi -1 point-to-point and collective communication, we show how wait states in one-sided communications can be detected in a more scalable fashion. We demonstrate the scalability of our method and its usefulness for the optimization cycle with applications running on up to 8,192 cores.

parallel computing | 2013

A scalable infrastructure for the performance analysis of passive target synchronization

Marc-André Hermanns; Sriram Krishnamoorthy; Felix Wolf

Partitioned global address space (PGAS) languages combine the convenient abstraction of shared memory with the notion of affinity, extending multi-threaded programming to large-scale systems with physically distributed memory. However, in spite of their obvious advantages, PGAS languages still lack appropriate tool support for performance analysis, one of the reasons why their adoption is still in its infancy. Some of the performance problems for which tool support is needed occur at the level of the underlying one-sided communication substrate, such as the Aggregate Remote Memory Copy Interface (ARMCI). One such example is the waiting time in situations where asynchronous data transfers cannot be completed without software intervention at the target side. This is not uncommon on systems with reduced operating-system kernels such as IBM Blue Gene/P where the use of progress threads would double the number of cores necessary to run an application. In this paper, we present an extension of the Scalasca trace-analysis infrastructure aimed at the identification and quantification of progress-related waiting times at larger scales. We demonstrate its utility and scalability using a benchmark running with up to 32,768 processes.

international conference on parallel processing | 2006

Specification of inefficiency patterns for MPI-2 one-sided communication

Andrej Kühnal; Marc-André Hermanns; Bernd Mohr; Felix Wolf

Automatic performance analysis of parallel programs can be accomplished by scanning event traces of program execution for patterns representing inefficient behavior. The temporal and spatial relationships between individual runtime events recorded in the event trace allow the recognition of wait states as a result of suboptimal parallel interaction. In our earlier work [1], we have shown how patterns related to mpi point-to-point and collective communication can be easily specified using common abstractions that represent execution-state information and links between related events. In this article, we present new abstractions targeting remote memory access (also referred to as one-sided communication) as defined in the mpi-2 standard. We also describe how the general structure of these abstractions differs from our earlier work to accommodate the more complicated sequence of data-transfer and synchronization operations required for this type of communication. To demonstrate the benefits of our methodology, we specify typical performance properties related to one-sided communication.

Proceedings of the 2nd Workshop on Visual Performance Analysis | 2015

Separating the wheat from the chaff: identifying relevant and similar performance data with visual analytics

Laura von Rüden; Marc-André Hermanns; Michael Behrisch; Daniel A. Keim; Bernd Mohr; Felix Wolf

Performance-analysis tools are indispensable for understanding and optimizing the behavior of parallel programs running on increasingly powerful supercomputers. However, with size and complexity of hardware and software on the rise, performance data sets are becoming so voluminous that their analysis poses serious challenges. In particular, the search space that must be traversed and the number of individual performance views that must be explored to identify phenomena of interest becomes too large. To mitigate this problem, we use visual analytics. Specifically, we accelerate the analysis of performance profiles by automatically identifying (1) relevant and (2) similar data subsets and their performance views. We focus on views of the virtual-process topology, showing that their relevance can be well captured with visual-quality metrics and that they can be further assigned to topical groups according to their visual features. A case study demonstrates that our approach helps reduce the search space by up to 80%.

Proceedings of the 21st European MPI Users' Group Meeting on | 2014

Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI Programs

Guoyong Mao; David Böhme; Marc-André Hermanns; Markus Geimer; Daniel Lorenz; Felix Wolf

Load imbalance usually introduces wait states into the execution of parallel programs. Being able to identify and quantify wait states is therefore essential for the diagnosis and remediation of this phenomenon. An established method of detecting wait states is to generate event traces and compare relevant timestamps across process boundaries. However, large trace volumes usually prevent the analysis of longer execution periods. In this paper, we present an extremely lightweight wait-state profiler which does not rely on traces that can be used to estimate wait states in MPI codes with arbitrarily long runtimes. The profiler combines scalability with portability and low overhead.

Parallel Tools Workshop | 2013

Generic Support for Remote Memory Access Operations in Score-P and OTF2

Andreas Knüpfer; Robert Dietrich; Jens Doleschal; Markus Geimer; Marc-André Hermanns; Christian Rössel; Ronny Tschüter; Bert Wesarg; Felix Wolf

Remote memory access (RMA) describes the ability of a process to access all or parts of the memory belonging to a remote process directly, without explicit participation of the remote side. There are a number of parallel programming models based on RMA operations that are relevant for High Performance Computing (HPC). On the one hand, Partitioned Global Address Space (PGAS) language extensions use RMA operations as underlying communication substrate, e.g. Co-Array Fortran and UPC. On the other hand, RMA programming APIs provide so called one-sided data transfer primitives as an alternative to the classic two-sided message passing. In this paper, we describe how Score-P, a scalable performance measurement infrastructure for parallel applications, is extended to support trace-based performance analyses of RMA parallelization models. Emphasis is given to the generic event model we designed to record RMA operations in the OTF2 trace format across a range of one-sided APIs and libraries.

Explore More