Christian Iwainsky | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christian Iwainsky is active.

Explore More

Publication

Featured researches published by Christian Iwainsky.

Computer Science - Research and Development | 2012

Brainware for green HPC

Christian H. Bischof; Dieter an Mey; Christian Iwainsky

The reduction of the infrastructural costs of HPC, in particular power consumption, currently is mainly driven by architectural advances in hardware. Recently, in the quest for the EFlop/s, hardware-software codesign has been advocated, owing to the realization that without some software support only heroic programmers could use high-end HPC machines. However, in the topically diverse world of universities, the EFlop/s is still very far off for most users, and yet their computational demands shape the HPC landscape in the foreseeable future. Based on experiences made at RWTH Aachen University and in the context of the distributed Computational Science and Engineering support of the UK HECToR program, we claim based on economic considerations that HPC hard- and software installations need to be complemented by a “brainware” component, i.e., trained HPC specialists supporting performance optimization of users’ codes. This statement itself is not new, and the establishment of simulation labs at HPC centers echoes this fact. However, based on our experiences, we quantify the savings resulting from brainware, thus providing an economic argument that sufficient brainware must be an integral part of any “green” HPC installation. Thus, it also follows that the current HPC funding regimes, which favor iron over staff, are fundamentally flawed, and long-term efficient HPC deployment must emphasize brainware development to a much greater extent.

european conference on parallel processing | 2009

Comparing the Usability of Performance Analysis Tools

Christian Iwainsky; Dieter an Mey

We take a look at the performance analysis tools Vampir, Scalasca, Sun Performance Analyzer and the Intel Trace Analyzer and Collector, which provide execution analysis of parallel programs for optimization and scaling purposes. We investigate, from a novice users point of view, to what extent these tools support frequently used programming languages and constructs, discuss their performance impact and the insight these tools provide focusing on the instrumentation and program analysis. For this we analyzed codes currently used at the RWTH Aachen University: XNS, DROPS and HPL.

international parallel and distributed processing symposium | 2016

Calltree-Controlled Instrumentation for Low-Overhead Survey Measurements

Christian Iwainsky; Christian H. Bischof

Survey style or overview measurements are important for performance analysis and monitoring. Here, the goal is to capture sufficient performance information of the target to enable identification and assessment of the performance relevant regions. A full measurement typically exhibits too much overhead and analysts must trade-off data quality with measurement overhead to achieve a suitable performance proxy. This is challenging, especially if the code is complex or unfamiliar, as many current tools rely on manual filtering to reduce the inevitable overhead. Existing semi-automatic approaches, such as call-depth or statement-controlled instrumentation, provide little support as often critical context information, such as the call context, is lost or the remaining measurement data is insufficient. We present a call-tree controlled instrumentation approach that avoids these pitfalls while providing improved overview measurement capability by aggregating the statements of sub-trees and instrumenting regions of likely low overhead. Applied to the coral.lulesh, coral.miniFE and DROPS benchmarks we observe low, nearly negligible measurement overhead, of less than 1 percent, while preserving a good representation of the overall application structure and associated performance behavior. Compared to existing methods our new approach provides an much better trade-off between instrumentation overhead and data fidelity and is much less dependent on the particular programming style of an application. In particular, it is well suited for object-oriented coding styles.

european conference on parallel processing | 2015

How Many Threads will be too Many? On the Scalability of OpenMP Implementations

Christian Iwainsky; Sergei Shudler; Alexandru Calotoiu; Alexandre Strube; Michael Knobloch; Christian H. Bischof; Felix Wolf

Exascale systems will exhibit much higher degrees of parallelism both in terms of the number of nodes and the number of cores per node. OpenMP is a widely used standard for exploiting parallelism on the level of individual nodes. Although successfully used on today’s systems, it is unclear how well OpenMP implementations will scale to much higher numbers of threads. In this work, we apply automated performance modeling to examine the scalability of OpenMP constructs across different compilers and platforms. We ran tests on Intel Xeon multi-board, Intel Xeon Phi, and Blue Gene with compilers from GNU, IBM, Intel, and PGI. The resulting models reveal a number of scalability issues in implementations of OpenMP constructs and show unexpected differences between compilers.

european conference on parallel processing | 2014

Catwalk: A Quick Development Path for Performance Models

Felix Wolf; Christian H. Bischof; Torsten Hoefler; Bernd Mohr; Gabriel Wittum; Alexandru Calotoiu; Christian Iwainsky; Alexandre Strube; Andreas Vogel

Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made—a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. The objective of the Catwalk project, which is carried out as part of the DFG Priority Programme 1648 Software for Exascale Computing (SPPEXA), is to automate key activities of the performance modeling process, making this powerful methodology easier to use and expanding its coverage. This article gives an overview of the project objectives, describes the results achieved so far, and outlines future work.

european conference on parallel processing | 2010

An approach to visualize remote socket traffic on the intel Nehalem-EX

Christian Iwainsky; Thomas Reichstein; Christopher Dahnken; Dieter an Mey; Christian Terboven; Andrey Semin; Christian H. Bischof

The integration of the memory controller on the processor die enables ever larger core counts in commodity hardware shared memory systems with Non-Uniform Memory Architecture properties. Shared memory parallelization with OpenMP is an elegant and widely used approach to leverage the power of such systems. The binding of the OpenMP threads to compute cores and the corresponding memory association are becoming even more critical in order to obtain optimal performance. In this work we provide a method to measure the amount of remote socket memory accesses a thread generates. We use available performance monitoring CPU counters in combination with thread binding on a quad socket Nehalem EX system. For visualization of the collected data we use Vampir.

ieee international conference on high performance computing data and analytics | 2009

Leveraging multicore cluster nodes by adding OpenMP to flow solvers parallelized with MPI

Christian Iwainsky; Samuel Sarholz; Dieter an Mey; Ralph Altenfeld

MPI is the predominant model for parallel programming in technical high performance computing. With an increasing number of cores and threads in cluster nodes the question arises whether pure MPI is an appropriate approach to utilize today’s compute clusters or if it is profitable to add another layer of parallelism within the nodes by applying OpenMP on a lower level. Investing a limited amount of manpower, we add OpenMP directives to three MPI production codes and compare and analyze the performance varying the number of MPI processes per node and the number of OpenMP threads per MPI process on current CMP/CMT architectures.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

Simulation of Primary Breakup for Diesel Spray with Phase Transition

Peng Zeng; Samuel Sarholz; Christian Iwainsky; B. Binninger; N. Peters; Marcus Herrmann

We perform direct numerical simulation on large distributed memory parallel computers in order to investigate the primary breakup process of diesel spray direct injection. Local refinement algorithm--- Refined level---set method has been used to reduce the memory requirement, and we analyze the performance by experiments on a 1024--- processor parallel computer.

acm conference on systems programming languages and applications software for humanity | 2017

The influence of HPCToolkit and Score-p on hardware performance counters

Jan-Patrick Lehr; Christian Iwainsky; Christian H. Bischof

Performance measurement and analysis are commonly carried out tasks for high-performance computing applications. Both sampling and instrumentation approaches for performance measurement can capture hardware performance counter (HWPC) metrics to asses the softwares ability to use the functional units of the processor. Since the measurement software usually executes on the same processor, it necessarily competes with the target application for hardware resources. Consequently, the measurement system perturbs the target application, which often results in runtime overhead. While the runtime overhead of different measurement techniques has been previously studied, it has not been thoroughly examined to what extent HWPC values are perturbed by the measurement process. In this paper, we investigate the influence of the two widely-used performance measurement systems HPCToolkit (sampling) and Score-P (instrumentation) w.r.t. their influence on HWPC. Our experiments on the SPEC CPU 2006 C/C++ benchmarks show that, while Score-Ps default instrumentation can massively increase runtime, it does not always heavily perturb relevant HWPC. On the other hand, HPCToolkit shows no significant runtime overhead, but significantly influences some relevant HWPC. We conclude that for every performance experiment sufficient baseline measurements are essential to identify the HWPC that remain valid indicators of performance for a given measurement technique. Thus, performance analysis tools need to offer easily accessible means to automate the baseline and validation functionality.

Software for Exascale Computing | 2016

Automatic Performance Modeling of HPC Applications

Felix Wolf; Christian H. Bischof; Alexandru Calotoiu; Torsten Hoefler; Christian Iwainsky; Grzegorz Kwasniewski; Bernd Mohr; Sergei Shudler; Alexandre Strube; Andreas Vogel; Gabriel Wittum

Many existing applications suffer from inherent scalability limitations that will prevent them from running at exascale. Current tuning practices, which rely on diagnostic experiments, have drawbacks because (i) they detect scalability problems relatively late in the development process when major effort has already been invested into an inadequate solution and (ii) they incur the extra cost of potentially numerous full-scale experiments. Analytical performance models, in contrast, allow application developers to address performance issues already during the design or prototyping phase. Unfortunately, the difficulties of creating such models combined with the lack of appropriate tool support still render performance modeling an esoteric discipline mastered only by a relatively small community of experts. This article summarizes the results of the Catwalk project, which aimed to create tools that automate key activities of the performance modeling process, making this powerful methodology accessible to a wider audience of HPC application developers.

Explore More