Jan Treibig
University of Erlangen-Nuremberg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jan Treibig.
international conference on parallel processing | 2010
Jan Treibig; Georg Hager; Gerhard Wellein
Exploiting the performance of todays microprocessors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and microbenchmarking for reliable upper performance bounds. Moreover, it includes a mpirun wrapper allowing for portable thread-core affinity in MPI and hybrid MPI/threaded applications. To demonstrate the capabilities of the tool set we show the influence of thread affinity on performance using the well-known OpenMP STREAM triad benchmark, use hardware counter tools to study the performance of a stencil code, and finally show how to detect bandwidth problems on ccNUMA-based compute nodes.Exploiting the performance of todays processors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command-line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and toggling hardware prefetchers. An API for using the performance counting features from user code is also included. We clearly state the differences to the widely used PAPI interface. To demonstrate the capabilities of the tool set we show the influence of thread pinning on performance using the well-known OpenMP STREAM triad benchmark, and use the affinity and hardware counter tools to study the performance of a stencil code specifically optimized to utilize shared caches on multicore chips.
Concurrency and Computation: Practice and Experience | 2016
Georg Hager; Jan Treibig; Johannes Habich; Gerhard Wellein
Modern multi‐core chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution‐Cache‐Memory (ECM) model to describe the single‐core and multi‐core performance of streaming kernels. The model refines the well‐known roofline model, because it can predict the scaling and the saturation behavior of bandwidth‐limited loop kernels on a multi‐core chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multi‐core processors and derive guidelines for energy‐efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice Boltzmann flow solver code. Copyright
Journal of Computational Science | 2011
Jan Treibig; Gerhard Wellein; Georg Hager
Abstract Stencil computations consume a major part of runtime in many scientific simulation codes. As prototypes for this class of algorithms we consider the iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient parallel implementations for cache-based multicore architectures. Temporal cache blocking is a known advanced optimization technique, which can reduce the pressure on the memory bus significantly. We apply and refine this optimization for a recently presented temporal blocking strategy designed to explicitly utilize multicore characteristics. Especially for the case of Gauss-Seidel smoothers we show that simultaneous multi-threading (SMT) can yield substantial performance improvements for our optimized algorithm on some architectures.
parallel processing and applied mathematics | 2009
Jan Treibig; Georg Hager
We present a diagnostic performance model for bandwidth-limited loop kernels which is founded on the analysis of modern cache based microarchitectures. This model allows an accurate performance prediction and evaluation for existing instruction codes. It provides an in-depth understanding of how performance for different memory hierarchy levels is made up. The performance of raw memory load, store and copy operations and a stream vector triad are analyzed and benchmarked on three modern x86-type quad-core architectures in order to demonstrate the capabilities of the model.
SIAM Journal on Scientific Computing | 2012
Klaus Iglberger; Georg Hager; Jan Treibig; Ulrich Rüde
In the last decade, expression templates (ETs) have gained a reputation as an efficient performance optimization tool for C++ codes. This reputation builds on several ET-based linear algebra frameworks focused on combining both elegant and high-performance C++ code. However, on closer examination the assumption that ETs are a performance optimization technique cannot be maintained. In this paper we compare the performance of several generations of ET-based frameworks. We analyze different ET methodologies and explain the inability of some ET implementations to deliver high performance for dense and sparse linear algebra operations. Additionally, we introduce the notion of “smart” ETs, which truly allow for a combination of high performance code with the elegance and maintainability of a domain-specific language.
ieee international conference on high performance computing data and analytics | 2013
Jan Treibig; Georg Hager; Hannes G. Hofmann; Joachim Hornegger; Gerhard Wellein
Volume reconstruction by backprojection is the computational bottleneck in many interventional clinical computed tomography (CT) applications. Today vendors in this field replace special purpose hardware accelerators with standard hardware such as multicore chips and GPGPUs. Medical imaging algorithms are on the verge of employing high-performance computing (HPC) technology, and are therefore an interesting new candidate for optimization. This paper presents low-level optimizations for the backprojection algorithm, guided by a thorough performance analysis on four generations of Intel multicore processors (Harpertown, Westmere, Westmere EX, and Sandy Bridge). We choose the RabbitCT benchmark, a standardized testcase well supported in industry, to ensure transparent and comparable results. Our aim is to provide not only the fastest possible implementation but also compare with performance models and hardware counter data in order to fully understand the results. We separate the influence of algorithmic optimizations, parallelization, SIMD vectorization, and microarchitectural issues and pinpoint problems with current SIMD instruction set extensions on standard CPUs (SSE, AVX). The use of assembly language is mandatory for best performance. Finally, we compare our results to the best GPGPU implementations available for this open competition benchmark.
arXiv: Distributed, Parallel, and Cluster Computing | 2011
Jan Treibig; Georg Hager; Gerhard Wellein
Exploiting the performance of today’s microprocessors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and microbenchmarking for reliable upper performance bounds. Moreover, it includes an mpirun wrapper allowing for portable thread-core affinity in MPI and hybrid MPI/threaded applications. To demonstrate the capabilities of the tool set we show the influence of thread affinity on performance using the well-known OpenMP STREAM triad benchmark, use hardware counter tools to study the performance of a stencil code, and finally show how to detect bandwidth problems on ccNUMA-based compute nodes.
acm sigplan symposium on principles and practice of parallel programming | 2014
Johannes Hofmann; Jan Treibig; Georg Hager; Gerhard Wellein
Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the efficiency of different SIMD-vectorized implementations of the RabbitCT benchmark. RabbitCT performs 3D image reconstruction by back projection, a vital operation in computed tomography applications. The underlying algorithm is a challenge for vectorization because it consists, apart from a streaming part, also of a bilinear interpolation requiring scattered access to image data. We analyze the performance of SSE (128 bit), AVX (256 bit), AVX2 (256 bit), and IMCI (512 bit) implementations on recent Intel x86 systems. A special emphasis is put on the vector gather implementation on Intel Haswell and Knights Corner microarchitectures. Finally we discuss why GPU implementations perform much better for this specific algorithm.
Parallel Processing Letters | 2010
Markus Wittmann; Georg Hager; Jan Treibig; Gerhard Wellein
Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. Benchmark results are presented for three current x86-based microprocessors, showing clearly that our optimization works best on designs with high-speed shared caches and low memory bandwidth per core. We furthermore demonstrate that simple bandwidth-based performance models are inaccurate for this kind of algorithm and employ a more elaborate, synthetic modeling procedure. Finally we show that temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment, albeit with limited benefit at strong scaling.
international conference on high performance computing and simulation | 2012
Klaus Iglberger; Georg Hager; Jan Treibig; Ulrich Rüde
Performance is of utmost importance for linear algebra libraries since they usually are the core of numerical and simulation packages and use most of the available compute time and resources. However, especially in large scale simulation frameworks the readability and ease of use of mathematical expressions is essential for a continuous maintenance, modification, and extension of the software framework. Based on these requirements, in the last decade C++ Expression Templates have gained a reputation as a suitable means to combine an elegant, domain-specific, and intuitive user interface with “HPC-grade” performance. Unfortunately, many of the available ET-based frameworks fall short of the expectation to deliver high performance, adding to the general mistrust towards C++ math libraries. In this paper we present performance results for Smart Expression Template libraries, demonstrating that by proper combination of high-level C++ code and low-level compute kernels both requirements, an elegant interface and high performance, can be achieved.