Jan Eitzinger
University of Erlangen-Nuremberg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jan Eitzinger.
international conference on parallel processing | 2015
Johannes Hofmann; Dietmar Fey; Michael Riedmann; Jan Eitzinger; Georg Hager; Gerhard Wellein
We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient SIMD-vectorized implementations on recent Intel processors. Using low-level instruction analysis and the execution-cache-memory (ECM) performance model we pinpoint the relevant performance bottlenecks for single-core and thread-parallel execution, and predict performance and saturation behavior. We show that the Kahan-enhanced scalar product comes at almost no additional cost compared to the naive (non-Kahan) scalar product if appropriate low-level optimizations, notably SIMD vectorization and unrolling, are applied. We also investigate the impact of architectural changes across four generations of Intel Xeon processors.
automation, robotics and control systems | 2016
Johannes Hofmann; Dietmar Fey; Jan Eitzinger; Georg Hager; Gerhard Wellein
This paper presents an in-depth analysis of Intels Haswell microarchitecture for streaming loop kernels. Among the new features examined are the dual-ring Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, enhancements such as new and improved execution units, as well as improvements throughout the memory hierarchy. The Execution-Cache-Memory diagnostic performance model is used together with a generic set of microbenchmarks to quantify the efficiency of the microarchitecture. The set of microbenchmarks is chosen in a way that it can serve as a blueprint for other streaming loop kernels.
Concurrency and Computation: Practice and Experience | 2017
Johannes Hofmann; Dietmar Fey; Michael Riedmann; Jan Eitzinger; Georg Hager; Gerhard Wellein
We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient single instruction multiple data‐vectorized implementations on recent multi‐core and many‐core processors. Using low‐level instruction analysis and the execution‐cache‐memory performance model, we pinpoint the relevant performance bottlenecks for single‐core and thread‐parallel execution and predict performance and saturation behavior. We show that the Kahan‐enhanced scalar product comes at almost no additional cost compared with the naive (non‐Kahan) scalar product if appropriate low‐level optimizations, notably single instruction multiple data vectorization and unrolling, are applied. The execution‐cache‐memory model is extended appropriately to accommodate not only modern Intel multicore chips but also the Intel Xeon Phi ‘Knights Corner’ coprocessor and an IBM POWER8 CPU. This allows us to discuss the impact of processor features on the performance across four modern architectures that are relevant for high performance computing. Copyright
Archive | 2017
Julian Hammer; Jan Eitzinger; Georg Hager; Gerhard Wellein
Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern architectures is close to impossible. Analytic loop performance modeling is a useful way to understand the relevant bottlenecks of code execution based on simple machine models. The Roofline Model and the Execution-Cache-Memory (ECM) model are proven approaches to performance modeling of loop nests. In comparison to the Roofline model, the ECM model can also describes the single-core performance and saturation behavior on a multicore chip.We give an introduction to the Roofline and ECM models, and to stencil performance modeling using layer conditions (LC). We then present Kerncraft, a tool that can automatically construct Roofline and ECM models for loop nests by performing the required code, data transfer, and LC analysis. The layer condition analysis allows to predict optimal spatial blocking factors for loop nests. Together with the models it enables an ab-initio estimate of the potential benefits of loop blocking optimizations and of useful block sizes. In cases where LC analysis is not easily possible, Kerncraft supports a cache simulator as a fallback option. Using a 25-point long-range stencil we demonstrate the usefulness and predictive power of the Kerncraft tool.
international conference on cluster computing | 2017
Thomas Röhl; Jan Eitzinger; Georg Hager; Gerhard Wellein
System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological cases, provides instant performance feedback to the users, offers initial data to judge on the optimization potential of applications and helps to build a statistical foundation about application specific system usage. The LIKWID monitoring stack is a modular framework build on top of the LIKWID tools library. It aims on enabling job specific performance monitoring using HPM data, system metrics and applicationlevel data for small to medium sized commodity clusters. Moreover, it is designed to integrate in existing monitoring infrastructures to speed up the change from pure system monitoring to job-aware monitoring.
Concurrency and Computation: Practice and Experience | 2017
Johannes Hofmann; Dietmar Fey; Michael Riedmann; Jan Eitzinger; Georg Hager; Gerhard Wellein
We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient single instruction multiple data‐vectorized implementations on recent multi‐core and many‐core processors. Using low‐level instruction analysis and the execution‐cache‐memory performance model, we pinpoint the relevant performance bottlenecks for single‐core and thread‐parallel execution and predict performance and saturation behavior. We show that the Kahan‐enhanced scalar product comes at almost no additional cost compared with the naive (non‐Kahan) scalar product if appropriate low‐level optimizations, notably single instruction multiple data vectorization and unrolling, are applied. The execution‐cache‐memory model is extended appropriately to accommodate not only modern Intel multicore chips but also the Intel Xeon Phi ‘Knights Corner’ coprocessor and an IBM POWER8 CPU. This allows us to discuss the impact of processor features on the performance across four modern architectures that are relevant for high performance computing. Copyright
Archive | 2016
Thomas Röhl; Jan Eitzinger; Georg Hager; Gerhard Wellein
Hardware performance monitoring (HPM) is a crucial ingredient of performance analysis tools. While there are interfaces like LIKWID, PAPI or the kernel interface perf_event which provide HPM access with some additional features, many higher level tools combine event counts with results retrieved from other sources like function call traces to derive (semi-)automatic performance advice. However, although HPM is available for x86 systems since the early 90s, only a small subset of the HPM features is used in practice. Performance patterns provide a more comprehensive approach, enabling the identification of various performance-limiting effects. Patterns address issues like bandwidth saturation, load imbalance, non-local data access in ccNUMA systems, or false sharing of cache lines. This work defines HPM event sets that are best suited to identify a selection of performance patterns on the Intel Haswell processor. We validate the chosen event sets for accuracy in order to arrive at a reliable pattern detection mechanism and point out shortcomings that cannot be easily circumvented due to bugs or limitations in the hardware.
Concurrency and Computation: Practice and Experience | 2016
Johannes Hofmann; Dietmar Fey; Michael Riedmann; Jan Eitzinger; Georg Hager; Gerhard Wellein
We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient single instruction multiple data‐vectorized implementations on recent multi‐core and many‐core processors. Using low‐level instruction analysis and the execution‐cache‐memory performance model, we pinpoint the relevant performance bottlenecks for single‐core and thread‐parallel execution and predict performance and saturation behavior. We show that the Kahan‐enhanced scalar product comes at almost no additional cost compared with the naive (non‐Kahan) scalar product if appropriate low‐level optimizations, notably single instruction multiple data vectorization and unrolling, are applied. The execution‐cache‐memory model is extended appropriately to accommodate not only modern Intel multicore chips but also the Intel Xeon Phi ‘Knights Corner’ coprocessor and an IBM POWER8 CPU. This allows us to discuss the impact of processor features on the performance across four modern architectures that are relevant for high performance computing. Copyright
ieee international conference on high performance computing data and analytics | 2015
Julian Hammer; Georg Hager; Jan Eitzinger; Gerhard Wellein
arXiv: Distributed, Parallel, and Cluster Computing | 2015
Johannes Hofmann; Jan Eitzinger; Dietmar Fey