Harald Köstler | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harald Köstler is active.

Explore More

Publication

Featured researches published by Harald Köstler.

Journal of Computational Science | 2011

WaLBerla: HPC software design for computational engineering simulations

Christian Feichtinger; Stefan Donath; Harald Köstler; Jan Götz; Ulrich Rüde

Abstract WaLBerla (Widely applicable Lattice-Boltzmann from Erlangen) is a massively parallel software framework supporting a wide range of physical phenomena. This article describes the software designs realizing the major goal of the framework, a good balance between expandability and scalable, highly optimized, hardware-dependent, special purpose kernels. To demonstrate our designs, we discuss the coupling of our Lattice-Boltzmann fluid flow solver and a method for fluid structure interaction. Additionally, we show a software design for heterogeneous computations on GPU and CPU utilizing optimized kernels. Finally, we estimate the software quality of the framework on the basis of software quality factors.

SIAM Journal on Scientific Computing | 2007

Numerical Mathematics of the Subtraction Method for the Modeling of a Current Dipole in EEG Source Reconstruction Using Finite Element Head Models

Carsten H. Wolters; Harald Köstler; Christian Möller; Jochen Härdtlein; Lars Grasedyck; Wolfgang Hackbusch

In electroencephalography (EEG) source analysis, a dipole is widely used as the model of the current source. The dipole introduces a singularity on the right-hand side of the governing Poisson-type differential equation that has to be treated specifically when solving the equation toward the electric potential. In this paper, we give a proof for existence and uniqueness of the weak solution in the function space of zero-mean potential functions, using a subtraction approach. The method divides the total potential into a singularity and a correction potential. The singularity potential is due to a dipole in an infinite region of homogeneous conductivity. We then state convergence properties of the finite element (FE) method for the numerical solution to the correction potential. We validate our approach using tetrahedra and regular and geometry-conforming node-shifted hexahedra elements in an isotropic three-layer sphere model and a model with anisotropic middle compartment. Validation is carried out using sophisticated visualization techniques, correlation coefficient (CC), and magnification factor (MAG) for a comparison of the numerical results with analytical series expansion formulas at the surface and within the volume conductor. For the subtraction approach, with regard to the accuracy in the anisotropic three-layer sphere model (CC of 0.998 or better and MAG of 4.3% or better over the whole range of realistic eccentricities) and to the computational complexity, 2mm node-shifted hexahedra achieve the best results. A relative FE solver accuracy of

ieee international conference on high performance computing data and analytics | 2013

A framework for hybrid parallel flow simulations with a trillion cells in complex geometries

Christian Godenschwager; Florian Schornbaum; Martin Bauer; Harald Köstler; Ulrich Rüde

10^{-4}

parallel computing | 2011

A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters

Christian Feichtinger; Johannes Habich; Harald Köstler; Georg Hager; Ulrich Rüde; Gerhard Wellein

is sufficient for the used algebraic multigrid preconditioned conjugate gradient approach. Finally, we visualize the computed potentials of the subtraction method in realistically shaped FE head volume conductor models with anisotropic skull compartments.

Concurrency and Computation: Practice and Experience | 2014

Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters

Björn Gmeiner; Harald Köstler; Markus Stürmer; Ulrich Rüde

waLBerla is a massively parallel software framework for simulating complex flows with the lattice Boltzmann method (LBM). Performance and scalability results are presented for SuperMUC, the worlds fastest x86-based supercomputer ranked number 6 on the Top500 list, and JUQUEEN, a Blue Gene/Q system ranked as number 5. We reach resolutions with more than one trillion cells and perform up to 1.93 trillion cell updates per second using 1.8 million threads. The design and implementation of waLBerla is driven by a careful analysis of the performance on current petascale supercomputers. Our fully distributed data structures and algorithms allow for efficient, massively parallel simulations on these machines. Elaborate node level optimizations and vectorization using SIMD instructions result in highly optimized compute kernels for the single- and two-relaxation-time LBM. Excellent weak and strong scaling is achieved for a complex vascular geometry of the human coronary tree.

parallel computing | 2015

Performance Modeling and Analysis of Heterogeneous Lattice Boltzmann Simulations on CPU-GPU Clusters

Christian Feichtinger; Johannes Habich; Harald Köstler; Ulrich Rüde; Takayuki Aoki

Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. We address this issue in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. Our multi-GPU implementation uses a block-structured MPI parallelization and is suitable for load balancing and heterogeneous computations on CPUs and GPUs. The overhead required for multi-GPU simulations is discussed in detail. It is demonstrated that a large fraction of the kernel performance can be sustained for weak scaling on InfiniBand clusters, leading to excellent parallel efficiency. However, in strong scaling scenarios using multiple GPUs is much less efficient than running CPU-only simulations on IBM BG/P and x86-based clusters. Hence, a cost analysis must determine the best course of action for a particular simulation task and hardware configuration. Finally we present weak scaling results of heterogeneous simulations conducted on CPUs and GPUs simultaneously, using clusters equipped with varying node configurations.

Numerical Linear Algebra With Applications | 2008

A fast full multigrid solver for applications in image processing

Markus Stürmer; Harald Köstler; Ulrich Rüde

This article studies the performance and scalability of a geometric multigrid solver implemented within the hierarchical hybrid grids (HHG) software package on current high performance computing clusters up to nearly 300,000 cores. HHG is based on unstructured tetrahedral finite elements that are regularly refined to obtain a block‐structured computational grid. One challenge is the parallel mesh generation from an unstructured input grid that roughly approximates a human head within a 3D magnetic resonance imaging data set. This grid is then regularly refined to create the HHG grid hierarchy. As test platforms, a BlueGene/P cluster located at Jülich supercomputing center and an Intel Xeon 5650 cluster located at the local computing center in Erlangen are chosen. To estimate the quality of our implementation and to predict runtime for the multigrid solver, a detailed performance and communication model is developed and used to evaluate the measured single node performance, as well as weak and strong scaling experiments on both clusters. Thus, for a given problem size, one can predict the number of compute nodes that minimize the overall runtime of the multigrid solver. Overall, HHG scales up to the full machines, where the biggest linear system solved on Jugene had more than one trillion unknowns. Copyright © 2012 John Wiley & Sons, Ltd.

Image and Vision Computing | 2007

3D optical flow computation using a parallel variational multigrid scheme with application to cardiac C-arm CT motion

El Mostafa Kalmoun; Harald Köstler; Ulrich Rüde

Abstract Computational fluid dynamic simulations are in general very compute intensive. Only by parallel simulations on modern supercomputers the computational demands of complex simulation tasks can be satisfied. Facing these computational demands GPUs offer high performance, as they provide the high floating point performance and memory to processor chip bandwidth. To successfully utilize GPU clusters for the daily business of a large community, usable software frameworks must be established on these clusters. The development of such software frameworks is only feasible with maintainable software designs that consider performance as a design objective right from the start. For this work we extend the software design concepts to achieve more efficient and highly scalable multi-GPU parallelization within our software framework waLBerla for multi-physics simulations centered around the lattice Boltzmann method. Our software designs now also support a pure-MPI and a hybrid parallelization approach capable of heterogeneous simulations using CPUs and GPUs in parallel. For the first time weak and strong scaling performance results obtained on the Tsubame 2.0 cluster for more than 1000 GPUs are presented using waLBerla. With the help of a new communication model the parallel efficiency of our implementation is investigated and analyzed in a detailed and structured performance analysis. The suitability of the waLBerla framework for production runs on large GPU clusters is demonstrated. As one possible application we show results of strong scaling experiments for flows through a porous medium.

ieee international conference on high performance computing data and analytics | 2015

Massively parallel phase-field simulations for ternary eutectic directional solidification

Martin Bauer; Johannes Hötzer; Marcus Jainta; Philipp Steinmetz; Marco Berghoff; Florian Schornbaum; Christian Godenschwager; Harald Köstler; Britta Nestler; Ulrich Rüde

We present a fast, cell-centered multigrid solver and apply it to image denoising and non-rigid diffusion-based image registration. In both applications, real-time performance is required in 3D and the multigrid method has to be compared with solvers based on fast Fourier transform (FFT). The optimization of the underlying variational approach results for image denoising directly in one time step of a parabolic linear heat equation, for image registration a non-linear second-order system of partial differential equations is obtained. This system is solved by a fixpoint iteration using a semi-implicit time discretization, where each time step again results in an elliptic linear heat equation. The multigrid implementation comes close to real-time performance for medium size medical images in 3D for both applications and is compared with a solver based on FFT using available libraries. Copyright

european conference on parallel processing | 2014

ExaStencils: Advanced Stencil-Code Engineering

Christian Lengauer; Sven Apel; Matthias Bolten; Armin Größlinger; Frank Hannig; Harald Köstler; Ulrich Rüde; Jürgen Teich; Alexander Grebhahn; Stefan Kronawitter; Sebastian Kuckuk; Hannah Rittich; Christian Schmitt

Motivated by recent applications to 3D medical motion estimation, we consider the problem of 3D optical flow computation in real time. The 3D optical flow model is derived from a straightforward extension of the 2D Horn-Schunck model and discretized using standard finite differences. We compare memory costs and convergence rates of four numerical schemes: Gauss-Seidel and multigrid with three different strategies of coarse grid operators discretization: direct coarsening, lumping and Galerkin approaches. Experimental results to compute 3D motion from cardiac C-arm CT images demonstrate that our variational multi-grid based on Galerkin discretization outperforms significantly the Gauss-Seidel method. The parallel implementation of the proposed scheme using domain partitioning shows that the algorithm scales well up to 32 processors on a cluster of AMD Opteron CPUs which consists of four-way nodes connected by an Infiniband network.

Explore More