David Böhme
Lawrence Livermore National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David Böhme.
international parallel and distributed processing symposium | 2012
David Böhme; Felix Wolf; Bronis R. de Supinski; Martin Schulz; Markus Geimer
The critical path, which describes the longest execution sequence without wait states in a parallel program, identifies the activities that determine the overall program runtime. Combining knowledge of the critical path with traditional parallel profiles, we have defined a set of compact performance indicators that help answer a variety of important performance-analysis questions, such as identifying load imbalance, quantifying the impact of imbalance on runtime, and characterizing resource consumption. By replaying event traces in parallel, we can calculate these performance indicators in a highly scalable way, making them a suitable analysis instrument for massively parallel programs with thousands of processes. Case studies with real-world parallel applications confirm that - in comparison to traditional profiles - our indicators provide enhanced insight into program behavior, especially when evaluating partitioning schemes of MPMD programs.
international conference on parallel processing | 2010
David Böhme; Markus Geimer; Felix Wolf; Lukas Arnold
Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira Jr. et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both in forward and backward direction, we can identify the processes and call paths responsible for the most severe imbalances even for runs with tens of thousands of processes.
Parallel Processing Letters | 2010
Brian J. N. Wylie; Markus Geimer; Bernd Mohr; David Böhme; Zoltán Szebenyi; Felix Wolf
Cray XT and IBM Blue Gene systems present current alternative approaches to constructing leadership computer systems relying on applications being able to exploit very large configurations of processor cores, and associated analysis tools must also scale commensurately to isolate and quantify performance issues that manifest at the largest scales. In studying the scalability of the Scalasca performance analysis toolset to several hundred thousand MPI processes on XT5 and BG/P systems, we investigated a progressive execution performance deterioration of the well-known ASCI Sweep3D compact application. Scalasca runtime summarization analysis quantified MPI communication time that correlated with computational imbalance, and automated trace analysis confirmed growing amounts of MPI waiting times. Further instrumentation, measurement and analyses pinpointed a conditional section of highly imbalanced computation which amplified waiting times inherent in the associated wavefront communication that seriously degraded overall execution efficiency at very large scales. By employing effective data collation, management and graphical presentation, in a portable and straightforward to use toolset, Scalasca was thereby able to demonstrate performance measurements and analyses with 294,912 processes.
Computer Physics Communications | 2009
Lukas Osterloh; Carlos Perez; David Böhme; José María Baldasano; Christine Böckmann; Lars Schneidenbach; David Vicente
We present new software for the retrieval of the volume distribution – and thus, other relevant microphysical properties such as the effective radius – of stratospheric and tropospheric aerosols from multiwavelength LIDAR data. We consider the basic equation as a linear ill-posed problem and solve the linear system derived from spline collocation. We consider as well the technical implications of the algorithm implementation. In order to reduce runtime which is incurred by the vast theoretical search space, experiments on the MareNostrum Supercomputer were made to understand the significance of the different search space dimensions on the quality of the solution with the goal of restricting or eliminating entirely certain dimensions of the search space, to massively reduce calculation time for later production runs. Results show that the search space can be reduced according to available computation power to still yield reasonable results. Also, the scalability of the parallel software proved to be good.
Parallel Tools Workshop | 2010
Markus Geimer; Felix Wolf; Brian J. N. Wylie; Daniel Becker; David Böhme; Wolfgang Frings; Marc-André Hermanns; Bernd Mohr; Zoltán Szebenyi
The number of processor cores on modern supercomputers is increasing from generation to generation, and as a consequence HPC applications are required to harness much higher degrees of parallelism to satisfy their growing demand for computing power. However, writing code that runs efficiently on large processor configurations remains a significant challenge. The situation is exacerbated by the rising number of cores imposing scalability demands not only on applications but also on the software tools needed for their development.
ieee international symposium on parallel distributed processing workshops and phd forum | 2010
Brian J. N. Wylie; David Böhme; Bernd Mohr; Zoltán Szebenyi; Felix Wolf
In studying the scalability of the Scalasca performance analysis toolset to several hundred thousand MPI processes on IBM Blue Gene/P, we investigated a progressive execution performance deterioration of the well-known ASCI Sweep3D compact application. Scalasca runtime summarization analysis quantified MPI communication time that correlated wth computational imbalance, and automated trace analysis confirmed growing amounts of MPI waiting times. Further instrumentation, measurement and analyses pinpointed a conditional section of highly imbalanced computation which amplified waiting times inherent in the associated wavefront communication that seriously degraded overall execution efficiency at very large scales. By employing effective data collation, management and graphical presentation, Scalasca was thereby able to demonstrate performance measurements and analyses with 294,912 processes for the first time.
Proceedings of the 20th European MPI Users' Group Meeting on | 2013
Marc-André Hermanns; Manfred Miklosch; David Böhme; Felix Wolf
To better understand the formation of wait states in MPI programs and to support the user in finding optimization targets in the case of load imbalance, a major source of wait states, we added in our earlier work two new trace-analysis techniques to Scalasca, a performance analysis tool designed for large-scale applications. In this paper, we show how the two techniques, which were originally restricted to two-sided and collective MPI communication, are extended to cover also one-sided communication. We demonstrate our experiences with benchmark programs and a mini-application representing the core of the POP ocean model.
international parallel and distributed processing symposium | 2012
David Böhme; Felix Wolf; Markus Geimer
Load or communication imbalance prevents many codes from taking advantage of the parallelism available on modern supercomputers. We present two scalable methods to highlight imbalance in parallel programs: The first method identifies delays that inflict wait states at subsequent synchronization points, and attributes their costs in terms of resource waste to the original cause. The second method combines knowledge of the critical path with traditional parallel profiles to derive a set of compact performance indicators that help answer a variety of important performance-analysis questions, such as identifying load imbalance, quantifying the impact of imbalance on runtime, and characterizing resource consumption. Both methods employ a highly scalable parallel replay of event traces, making them a suitable analysis instrument for massively parallel MPI programs with tens of thousands of processes.
international conference on parallel processing | 2009
David Böhme; Marc-André Hermann; Markus Geimer; Felix Wolf
In our previous work [1], we introduced performance simulation as an instrument to verify hypotheses on causality between locally and spatially distant performance phenomena without altering the application itself. This is accomplished by modifying MPI event traces and using them to simulate hypothetical message-passing behavior. Here, we present enhancements to our approach, which was previously restricted to blocking communication, that now allow us to correctly simulate MPI non-blocking communication. We enhanced the underlying trace data format to record communication requests, and extended the simulator to even retain the inherently non-deterministic behavior of operations such as MPI_Waitany.
ieee international conference on high performance computing data and analytics | 2015
Katherine E. Isaacs; Abhinav Bhatele; Jonathan Lifflander; David Böhme; Todd Gamblin; Martin Schulz; Bernd Hamann; Peer-Timo Bremer
Asynchrony and non-determinism in Charm++ programs present a significant challenge in analyzing their event traces. We present a new framework to organize event traces of parallel programs written in Charm++. Our reorganization allows one to more easily explore and analyze such traces by providing context through logical structure. We describe several heuristics to compensate for missing dependencies between events that currently cannot be easily recorded. We introduce a new task ordering that recovers logical structure from the non-deterministic execution order. Using the logical structure, we define several metrics to help guide developers to performance problems. We demonstrate our approach through two proxy applications written in Charm++. Finally, we discuss the applicability of this framework to other task-based runtimes and provide guidelines for tracing to support this form of analysis.