Markus Geimer
Forschungszentrum Jülich
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Markus Geimer.
Parallel Tools Workshop | 2012
Andreas Knüpfer; Christian Rössel; Dieter an Mey; Scott Biersdorff; Kai Diethelm; Dominic Eschweiler; Markus Geimer; Michael Gerndt; Daniel Lorenz; Allen D. Malony; Wolfgang E. Nagel; Yury Oleynik; Peter Philippen; Pavel Saviankou; Dirk Schmidl; Sameer Shende; Ronny Tschüter; Michael Wagner; Bert Wesarg; Felix Wolf
This paper gives an overview about the Score-P performance measurement infrastructure which is being jointly developed by leading HPC performance tools groups. It motivates the advantages of the joint undertaking from both the developer and the user perspectives, and presents the design and components of the newly developed Score-P performance measurement infrastructure. Furthermore, it contains first evaluation results in comparison with existing performance tools and presents an outlook to the long-term cooperative development of the new system.
parallel computing | 2009
Markus Geimer; Felix Wolf; Brian J. N. Wylie; Bernd Mohr
When scaling message-passing applications to thousands of processors, their performance is often affected by wait states that occur when processes fail to reach synchronization points simultaneously. As a first step in reducing the performance impact, we have shown in our earlier work that wait states can be diagnosed by searching event traces for characteristic patterns. However, our initial sequential search method did not scale beyond several hundred processes. Here, we present a scalable approach, based on a parallel replay of the target applications communication behavior, that can efficiently identify wait states at the previously inaccessible scale of 65,536 processes and that has potential for even larger configurations. We explain how our new approach has been integrated into a comprehensive parallel tool architecture, which we use to demonstrate that wait states may consume a major fraction of the execution time at larger scales.
international parallel and distributed processing symposium | 2012
David Böhme; Felix Wolf; Bronis R. de Supinski; Martin Schulz; Markus Geimer
The critical path, which describes the longest execution sequence without wait states in a parallel program, identifies the activities that determine the overall program runtime. Combining knowledge of the critical path with traditional parallel profiles, we have defined a set of compact performance indicators that help answer a variety of important performance-analysis questions, such as identifying load imbalance, quantifying the impact of imbalance on runtime, and characterizing resource consumption. By replaying event traces in parallel, we can calculate these performance indicators in a highly scalable way, making them a suitable analysis instrument for massively parallel programs with thousands of processes. Case studies with real-world parallel applications confirm that - in comparison to traditional profiles - our indicators provide enhanced insight into program behavior, especially when evaluating partitioning schemes of MPMD programs.
parallel computing | 2010
Markus Geimer; Pavel Saviankou; Alexandre Strube; Zoltán Szebenyi; Felix Wolf; Brian J. N. Wylie
Scalasca is an open-source toolset that can be used to analyze the performance behavior of parallel applications and to identify opportunities for optimization. Target applications include simulation codes from science and engineering based on the parallel programming interfaces MPI and/or OpenMP. Scalasca, which has been specifically designed for use on large-scale machines such as IBM Blue Gene and Cray XT, integrates runtime summaries suitable to obtain a performance overview with in-depth studies of concurrent behavior via event tracing. Although Scalasca was already successfully used with codes running with 294,912 cores on a 72-rack Blue Gene/P system, the current software design shows scalability limitations that adversely affect user experience and that will present a serious obstacle on the way to mastering larger scales in the future. In this paper, we outline how to address the two most important ones, namely the unification of local identifiers at measurement finalization as well as collating and displaying analysis reports.
2006 IEEE Symposium on Interactive Ray Tracing | 2006
Oliver Abert; Markus Geimer; Stefan Müller
Recently it has been shown that Bezier surfaces can be used as a geometric primitive for interactive ray tracing on a single commodity PC. However, the Bezier representation is restricted, as a large number of control points also imply a high polynomial degree, thus reducing the frame rate significantly. In this work we present a fast, efficient and robust algorithm to ray trace trimmed NURBS surfaces of arbitrary degree. Furthermore, our approach is largely independent of the number of control points of a surface with respect to the rendering performance. Additionally the degree and the number of control points of a surface do not influence the numerical stability of the intersection algorithm. The desired high performance is achieved by taking a novel approach of surface evaluation, which requires only minimal preprocessing. We present a method to transform the computationally expensive Cox-de Boor recursion into a SIMD suitable form that maximizes performance by avoiding the recursion and drastically reduces the number of executed commands
international parallel and distributed processing symposium | 2007
Daniel Becker; Felix Wolf; Wolfgang Frings; Markus Geimer; Brian J. N. Wylie; Bernd Mohr
The processing power and memory capacity of independent and heterogeneous parallel machines can be combined to form a single parallel system that is more powerful than any of its constituents. However, achieving satisfactory application performance on such a metacomputer is hard because the high latency of inter-machine communication as well as differences in hardware of constituent machines may introduce various types of wait states. In our earlier work, we have demonstrated that automatic pattern search in event traces can identify the sources of wait states in parallel applications running on a single computer. In this article, we describe how this approach can be extended to metacomputing environments with special emphasis on performance problems related to inter-machine communication. In addition, we demonstrate the benefits of our solution using a real-world multi-physics application.
international conference on parallel processing | 2010
David Böhme; Markus Geimer; Felix Wolf; Lukas Arnold
Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira Jr. et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both in forward and backward direction, we can identify the processes and call paths responsible for the most severe imbalances even for runs with tens of thousands of processes.
Parallel Processing Letters | 2010
Brian J. N. Wylie; Markus Geimer; Bernd Mohr; David Böhme; Zoltán Szebenyi; Felix Wolf
Cray XT and IBM Blue Gene systems present current alternative approaches to constructing leadership computer systems relying on applications being able to exploit very large configurations of processor cores, and associated analysis tools must also scale commensurately to isolate and quantify performance issues that manifest at the largest scales. In studying the scalability of the Scalasca performance analysis toolset to several hundred thousand MPI processes on XT5 and BG/P systems, we investigated a progressive execution performance deterioration of the well-known ASCI Sweep3D compact application. Scalasca runtime summarization analysis quantified MPI communication time that correlated with computational imbalance, and automated trace analysis confirmed growing amounts of MPI waiting times. Further instrumentation, measurement and analyses pinpointed a conditional section of highly imbalanced computation which amplified waiting times inherent in the associated wavefront communication that seriously degraded overall execution efficiency at very large scales. By employing effective data collation, management and graphical presentation, in a portable and straightforward to use toolset, Scalasca was thereby able to demonstrate performance measurements and analyses with 294,912 processes.
parallel computing | 2006
Markus Geimer; Felix Wolf; Andreas Knüpfer; Bernd Mohr; Brian J. N. Wylie
Automatic trace analysis is an effective method of identifying complex performance phenomena in parallel applications. To simplify the development of complex trace-analysis algorithms, the earl library interface offers high-level access to individual events contained in a global trace file. However, as the size of parallel systems grows further and the number of processors used by individual applications is continuously raised, the traditional approach of analyzing a single global trace file becomes increasingly constrained by the large number of events. To enable scalable trace analysis, we present a new design of the aforementioned earl interface that accesses multiple local trace files in parallel while offering means to conveniently exchange events between processes. This article describes the modified view of the trace data as well as related programming abstractions provided by the new pearl library interface and discusses its application in performance analysis.
international workshop on openmp | 2012
Dirk Schmidl; Peter Philippen; Daniel Lorenz; Christian Rössel; Markus Geimer; Dieter an Mey; Bernd Mohr; Felix Wolf
Version 3.0 of the OpenMP specification introduced the task construct for the explicit expression of dynamic task parallelism. Although automated load-balancing capabilities make it an attractive parallelization approach for programmers, the difficulty of integrating this new dimension of parallelism into traditional models of performance data has so far prevented the emergence of appropriate performance tools. Based on our earlier work, where we have introduced instrumentation for task-based programs, we present initial concepts for analyzing the data delivered by this instrumentation. We define three typical performance problems related to tasking and show how they can be visually explored using event traces. Special emphasis is placed on the event model used to capture the execution of task instances and on how the time consumed by the program is mapped onto tasks in the most meaningful way. We illustrate our approach with practical examples.