Brian J. N. Wylie
Forschungszentrum Jülich
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Brian J. N. Wylie.
parallel computing | 2009
Markus Geimer; Felix Wolf; Brian J. N. Wylie; Bernd Mohr
When scaling message-passing applications to thousands of processors, their performance is often affected by wait states that occur when processes fail to reach synchronization points simultaneously. As a first step in reducing the performance impact, we have shown in our earlier work that wait states can be diagnosed by searching event traces for characteristic patterns. However, our initial sequential search method did not scale beyond several hundred processes. Here, we present a scalable approach, based on a parallel replay of the target applications communication behavior, that can efficiently identify wait states at the previously inaccessible scale of 65,536 processes and that has potential for even larger configurations. We explain how our new approach has been integrated into a comprehensive parallel tool architecture, which we use to demonstrate that wait states may consume a major fraction of the execution time at larger scales.
international parallel and distributed processing symposium | 2007
Daniel Becker; Felix Wolf; Wolfgang Frings; Markus Geimer; Brian J. N. Wylie; Bernd Mohr
The processing power and memory capacity of independent and heterogeneous parallel machines can be combined to form a single parallel system that is more powerful than any of its constituents. However, achieving satisfactory application performance on such a metacomputer is hard because the high latency of inter-machine communication as well as differences in hardware of constituent machines may introduce various types of wait states. In our earlier work, we have demonstrated that automatic pattern search in event traces can identify the sources of wait states in parallel applications running on a single computer. In this article, we describe how this approach can be extended to metacomputing environments with special emphasis on performance problems related to inter-machine communication. In addition, we demonstrate the benefits of our solution using a real-world multi-physics application.
parallel computing | 2006
Markus Geimer; Felix Wolf; Andreas Knüpfer; Bernd Mohr; Brian J. N. Wylie
Automatic trace analysis is an effective method of identifying complex performance phenomena in parallel applications. To simplify the development of complex trace-analysis algorithms, the earl library interface offers high-level access to individual events contained in a global trace file. However, as the size of parallel systems grows further and the number of processors used by individual applications is continuously raised, the traditional approach of analyzing a single global trace file becomes increasingly constrained by the large number of events. To enable scalable trace analysis, we present a new design of the aforementioned earl interface that accesses multiple local trace files in parallel while offering means to conveniently exchange events between processes. This article describes the modified view of the trace data as well as related programming abstractions provided by the new pearl library interface and discusses its application in performance analysis.
Scientific Programming | 2008
Brian J. N. Wylie; Markus Geimer; Felix Wolf
Developers of applications with large-scale computing requirements are currently presented with a variety of high-performance systems optimised for message-passing, however, effectively exploiting the available computing resources remains a major challenge. In addition to fundamental application scalability characteristics, application and system peculiarities often only manifest at extreme scales, requiring highly scalable performance measurement and analysis tools that are convenient to incorporate in application development and tuning activities. We present our experiences with a multigrid solver benchmark and state-of-the-art real-world applications for numerical weather prediction and computational fluid dynamics, on three quite different multi-thousand-processor supercomputer systems - Cray XT3/4, MareNostrum & Blue Gene/L - using the newly-developed SCALASCA toolset to quantify and isolate a range of significant performance issues.
european conference on parallel processing | 2009
Zoltán Szebenyi; Brian J. N. Wylie; Felix Wolf
PEPC (Pretty Efficient Parallel Coulomb-solver) is a complex HPC application developed at the Julich Supercomputing Centre, scaling to thousands of processors. This is a case study of challenges faced when applying the Scalasca parallel performance analysis toolset to this intricate example at relatively high processor counts. The Scalasca version used in this study has been extended to distinguish iteration/timestep phases to provide a better view of the underlying mechanisms of the application execution. The added value of the additional analyses and presentations is then assessed to determine requirements for possible future integration within Scalasca.
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface | 2007
Brian J. N. Wylie; Markus Geimer; Mike Nicolai; Markus Probst
The xns computational fluid dynamics code was successfully running on Blue Gene/L, however, its scalability was unsatisfactory until the first Julich BlueGene/L Scaling Workshop provided an opportunity for the application developers and performance analysts to start working together. Investigation of solver performance pin-pointed a communication bottleneck that appeared with approximately 900 processes, and subsequent remediation allowed the application to continue scaling with a four-fold simulation performance improvement at 4,096 processes. This experience also validated the scalasca performance analysis toolset, when working with a complex application at large scale, and helped direct the development of more comprehensive analyses. Performance properties have now been incorporated to automatically quantify point-to-point synchronisation time and wait states in scan operations, both of which were significant for xns on BlueGene/L.
international parallel and distributed processing symposium | 2011
Zolt´n Szebenyi; Todd Gamblin; Martin Schulz; Bronis R. de Supinski; Felix Wolf; Brian J. N. Wylie
We can profile the performance behavior of parallel programs at the level of individual call paths through sampling or direct instrumentation. While we can easily control measurement dilation by adjusting the sampling frequency, the statistical nature of sampling and the difficulty of accessing the parameters of sampled events make it unsuitable for obtaining certain communication metrics, such as the size of message payloads. Alternatively, direct instrumentation, which is preferable for capturing message-passing events, can excessively dilate measurements, particularly for C++ programs, which often have many short but frequently called class member functions. Thus, we combine these techniques in a unified framework that exploits the strengths of each approach while avoiding their weaknesses: We use direct instrumentation to intercept MPI routines while we record the execution of the remaining code through low-overhead sampling. One of the main technical hurdles mastered was the inexpensive and portable determination of call-path information during the invocation of MPI routines. We show that the overhead of our implementation is sufficiently low to support substantial performance improvement of a C++ fluid-dynamics code.
ieee international symposium on parallel distributed processing workshops and phd forum | 2010
Brian J. N. Wylie; David Böhme; Bernd Mohr; Zoltán Szebenyi; Felix Wolf
In studying the scalability of the Scalasca performance analysis toolset to several hundred thousand MPI processes on IBM Blue Gene/P, we investigated a progressive execution performance deterioration of the well-known ASCI Sweep3D compact application. Scalasca runtime summarization analysis quantified MPI communication time that correlated wth computational imbalance, and automated trace analysis confirmed growing amounts of MPI waiting times. Further instrumentation, measurement and analyses pinpointed a conditional section of highly imbalanced computation which amplified waiting times inherent in the associated wavefront communication that seriously degraded overall execution efficiency at very large scales. By employing effective data collation, management and graphical presentation, Scalasca was thereby able to demonstrate performance measurements and analyses with 294,912 processes for the first time.
parallel computing | 2006
Brian J. N. Wylie; Felix Wolf; Bernd Mohr; Markus Geimer
Straightforward trace collection and processing becomes increasingly challenging and ultimately impractical for more complex, long-running, highly parallel applications. Accordingly, the SCALASCA project is extending the kojak measurement system for MPI, OpenMP and partitioned global address space (pgas) parallel applications to incorporate runtime management and summarisation capabilities. This offers a more scalable and effective profile of parallel execution performance for an initial overview and to direct instrumentation and event tracing to the key functions and callpaths for comprehensive analysis. The design and re-structuring of the revised measurement system are described, highlighting the synergies possible from integrated runtime callpath summarisation and event tracing for scalable parallel execution performance diagnosis. Early results from measurements of 16,384 MPI processes on IBM BlueGene/L already demonstrate considerably improved scalability.
extreme science and engineering discovery environment | 2013
Brian J. N. Wylie; Wolfgang Frings
Intel Xeon Phi coprocessors based on the Many Integrated Core (MIC) architecture are starting to appear in HPC systems, with Stampede being a prominent example available within the XSEDE cyber-infrastructure. Porting MPI and OpenMP applications to such systems is often no more than simple recompilation, however, execution performance needs to be carefully analyzed and tuned to effectively exploit their unique capabilities. For performance measurement and analysis tools, the variety of execution modes need to be supported in a consistent and convenient manner, and especially execution configurations involving large numbers of compute nodes each with several multicore host processors and many-core coprocessors. Early experience using the open-source Scalasca toolset for runtime summarization and automatic trace analysis with the NPB BT-MZ MPI+OpenMP parallel application on Stampede is reported, along with discussion of on-going and future work.