Simon Heybrock | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Simon Heybrock is active.

Explore More

Publication

Featured researches published by Simon Heybrock.

Computing in Science and Engineering | 2008

QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine

Gottfried Goldrian; Thomas Huth; Benjamin Krill; J. Lauritsen; Heiko Schick; Ibrahim A. Ouda; Simon Heybrock; Dieter Hierl; T. Maurer; Nils Meyer; A. Schäfer; Stefan Solbrig; Thomas Streuer; Tilo Wettig; Dirk Pleiter; Karl-Heinz Sulanke; Frank Winter; H. Simma; Sebastiano Fabio Schifano; R. Tripiccione

Application-driven computers for lattice gauge theory simulations have often been based on system-on-chip designs, but the development costs can be prohibitive for academic project budgets. An alternative approach uses compute nodes based on a commercial processor tightly coupled to a custom-designed network processor. Preliminary analysis shows that this solution offers good performance, but it also entails several challenges, including those arising from the processors multicore structure and from implementing the network processor on a field-programmable gate array.

ieee international conference on high performance computing data and analytics | 2014

Lattice QCD with domain decomposition on Intel ® Xeon Phi ™ co-processors

Simon Heybrock; Balint Joo; Dhiraj D. Kalamkar; Mikhail Smelyanskiy; Karthikeyan Vaidyanathan; Tilo Wettig; Pradeep Dubey

The gap between the cost of moving data and the cost of computing continues to grow, making it ever harder to design iterative solvers on extreme-scale architectures. This problem can be alleviated by alternative algorithms that reduce the amount of data movement. We investigate this in the context of Lattice Quantum Chromo dynamics and implement such an alternative solver algorithm, based on domain decomposition, on Intel® Xeon Phi co-processor (KNC) clusters. We demonstrate close-to-linear on-chip scaling to all 60 cores of the KNC. With a mix of single- and half-precision the domain-decomposition method sustains 400-500 Gflop/s per chip. Compared to an optimized KNC implementation of a standard solver [1], our full multi-node domain-decomposition solver strong-scales to more nodes and reduces the time-to-solution by a factor of 5.

Computer Physics Communications | 2011

A nested Krylov subspace method to compute the sign function of large complex matrices

Jacques Bloch; Simon Heybrock

We present an acceleration of the well-established Krylov–Ritz methods to compute the sign function of large complex matrices, as needed in lattice QCD simulations involving the overlap Dirac operator at both zero and nonzero baryon density. Krylov–Ritz methods approximate the sign function using a projection on a Krylov subspace. To achieve a high accuracy this subspace must be taken quite large, which makes the method too costly. The new idea is to make a further projection on an even smaller, nested Krylov subspace. If additionally an intermediate preconditioning step is applied, this projection can be performed without affecting the accuracy of the approximation, and a substantial gain in efficiency is achieved for both Hermitian and non-Hermitian matrices. The numerical efficiency of the method is demonstrated on lattice configurations of sizes ranging from 44 to 104, and the new results are compared with those obtained with rational approximation methods.

arXiv: High Energy Physics - Lattice | 2010

QPACE - a QCD parallel computer based on Cell processors

H. Baier; Hans Boettiger; C. Gomez; Dirk Pleiter; Nils Meyer; A. Nobile; Zoltan Fodor; Joerg-Stephan Vogt; K.-H. Sulanke; Simon Heybrock; Frank Winter; U. Fischer; T. Maurer; Thomas Huth; Ibrahim A. Ouda; M. Drochner; Heiko Schick; F. Schifano; A. Schäfer; H. Simma; J. Lauritsen; Norbert Eicker; Marcello Pivanti; Matthias Husken; Thomas Streuer; Gottfried Goldrian; Tilo Wettig; Thomas Lippert; Dieter Hierl; Benjamin Krill

QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system. Furthermore, we provide initial performance results and report on the installation of 8 QPACE racks providing an aggregate peak performance of 200 TFlops.

Computer Physics Communications | 2010

Short-recurrence Krylov subspace methods for the overlap Dirac operator at nonzero chemical potential☆

Jacques Bloch; Tobias Breu; Andreas Frommer; Simon Heybrock; Katrin Schäfer; Tilo Wettig

The overlap operator in lattice QCD requires the computation of the sign function of a matrix, which is non-Hermitian in the presence of a quark chemical potential. In previous work we introduced an Arnoldi-based Krylov subspace approximation, which uses long recurrences. Even after the deflation of critical eigenvalues, the low efficiency of the method restricts its application to small lattices. Here we propose new short-recurrence methods which strongly enhance the efficiency of the computational method. Using rational approximations to the sign function we introduce two variants, based on the restarted Arnoldi process and on the two-sided Lanczos method, respectively, which become very efficient when combined with multishift solvers. Alternatively, in the variant based on the two-sided Lanczos method the sign function can be evaluated directly. We present numerical results which compare the efficiencies of a restarted Arnoldi-based method and the direct two-sided Lanczos approximation for various lattice sizes. We also show that our new methods gain substantially when combined with deflation.

arXiv: Computational Physics | 2016

Adaptive algebraic multigrid on SIMD architectures

Simon Heybrock; Matthias Rottmann; Peter Georg; Tilo Wettig

We present details of our implementation of the Wuppertal adaptive algebraic multigrid code DD-

arXiv: High Energy Physics - Lattice | 2016

Multiple right-hand-side setup for the DD-

Daniel Richtmann; Simon Heybrock; Tilo Wettig

\alpha

arXiv: High Energy Physics - Lattice | 2009

\alpha

H. Baier; Hans Boettiger; Stefan Solbrig; Dirk Pleiter; Nils Meyer; A. Nobile; Zoltan Fodor; K.-H. Sulanke; Simon Heybrock; Frank Winter; U. Fischer; T. Maurer; Thomas Huth; Ibrahim A. Ouda; M. Drochner; Heiko Schick; F. Schifano; H. Simma; J. Lauritsen; Norbert Eicker; Marcello Pivanti; A. Schafer; Thomas Streuer; Gottfried Goldrian; Tilo Wettig; Thomas Lippert; Dieter Hierl; Benjamin Krill; R. Tripiccione; J. McFadden

AMG on SIMD architectures, with particular emphasis on the Intel Xeon Phi processor (KNC) used in QPACE 2. As a smoother, the algorithm uses a domain-decomposition-based solver code previously developed for the KNC in Regensburg. We optimized the remaining parts of the multigrid code and conclude that it is a very good target for SIMD architectures. Some of the remaining bottlenecks can be eliminated by vectorizing over multiple test vectors in the setup, which is discussed in the contribution of Daniel Richtmann.

arXiv: High Energy Physics - Lattice | 2011

AMG

Simon Heybrock

The setup cost of a modern solver such as DD-\alpha AMG (Wuppertal Multigrid) is a significant contribution to the total time spent on solving the Dirac equation, and in HMC it can even be dominant. We present an improved implementation of this algorithm with modified computation order in the setup procedure. By processing multiple right-hand sides simultaneously we can alleviate many of the performance issues of the default single right-hand-side setup. The main improvements are as follows: By combining multiple right-hand sides the message size for off-chip communication is larger, which leads to better utilization of the network bandwidth. Many matrix-vector products are replaced by matrix-matrix products, leading to better cache reuse. The synchronization overhead inflicted by on-chip parallelization (threading), which is becoming crucial on many-core architectures such as the Intel Xeon Phi, is effectively reduced. In the parts implemented so far, we observe a speedup of roughly 3x compared to the optimized version of the single right-hand-side setup on realistic lattices.

arXiv: High Energy Physics - Lattice | 2010

Status of the QPACE Project

Jacques Bloch; Tobias Breu; Andreas Frommer; Simon Heybrock; Tilo Wettig

We give an overview of the QPACE project, which is pursuing the development of a massively parallel, scalable supercomputer for LQCD. The machine is a three-dimensional torus of identical processing nodes, based on the PowerXCell 8i processor. The nodes are connected by an FPGAbased, application-optimized network processor attached to the PowerXCell 8i processor. We present a performance analysis of lattice QCD codes on QPACE and corresponding hardware benchmarks.

Explore More