Dominik Göddeke
Technical University of Dortmund
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dominik Göddeke.
Journal of Computational Physics | 2010
Dimitri Komatitsch; Gordon Erlebacher; Dominik Göddeke; David Michéa
We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x.
International Journal of Parallel, Emergent and Distributed Systems | 2007
Dominik Göddeke; Robert Strzodka; Stefan Turek
In this survey paper, we compare native double precision solvers with emulated- and mixed-precision solvers of linear systems of equations as they typically arise in finite element discretisations. The emulation utilises two single float numbers to achieve higher precision, while the mixed precision iterative refinement computes residuals and updates the solution vector in double precision but solves the residual systems in single precision. Both techniques have been known since the 1960s, but little attention has been devoted to their performance aspects. Motivated by changing paradigms in processor technology and the emergence of highly-parallel devices with outstanding single float performance, we adapt the emulation and mixed precision techniques to coupled hardware configurations, where the parallel devices serve as scientific co-processors. The performance advantages are examined with respect to speedups over a native double precision implementation (time aspect) and reduced area requirements for a chip (space aspect). The paper begins with an overview of the theoretical background, algorithmic approaches and suitable hardware architectures. We then employ several conjugate gradient (CG) and multigrid solvers and study their behaviour for different parameter settings of the iterative refinement technique. Concrete speedup factors are evaluated on the coupled hardware configuration of a general-purpose CPU and a graphics processor. The dual performance aspect of potential area savings is assessed on a field programmable gate array (FPGA). In the last part, we test the applicability of the proposed mixed precision schemes with ill-conditioned matrices. We conclude that the mixed precision approach works very well with the parallel co-processors gaining speedup factors of four to five, and area savings of three to four, while maintaining the same accuracy as a reference solver executing everything in double precision.
parallel computing | 2007
Dominik Göddeke; Robert Strzodka; Jamaludin Mohd-Yusof; Patrick S. McCormick; Sven H. M. Buijssen; Matthias Grajewski; Stefan Turek
The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance- and power-related metrics.
computational science and engineering | 2008
Dominik Göddeke; Robert Strzodka; Jamaludin Mohd-Yusof; Patrick S. McCormick; Hilmar Wobker; Christian Becker; Stefan Turek
This paper explores the coupling of coarse and fine-grained parallelism for Finite Element (FE) simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity Graphics Processing Units (GPUs), addressing the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems.
field-programmable custom computing machines | 2006
Robert Strzodka; Dominik Göddeke
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper the authors take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson problem as a typical PDE example the authors show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, the authors adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the conjugate gradient solver. The authors combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA
international conference on high performance computing and simulation | 2009
Dominik Göddeke; Sven H. M. Buijssen; Hilmar Wobker; Stefan Turek
We have previously suggested a minimally invasive approach to include hardware accelerators into an existing large-scale parallel finite element PDE solver toolkit, and implemented it into our software FEAST. Our concept has the important advantage that applications built on top of FEAST benefit from the acceleration immediately, without changes to application code. In this paper we explore the limitations of our approach by accelerating a Navier-Stokes solver. This nonlinear saddle point problem is much more involved than our previous tests, and does not exhibit an equally favourable acceleration potential: Not all computational work is concentrated inside the linear solver. Nonetheless, we are able to achieve speedups of more than a factor of two on a small GPU-enhanced cluster. We conclude with a discussion how our concept can be altered to further improve acceleration.
Journal of Computational Physics | 2013
Dominik Göddeke; Dimitri Komatitsch; Markus Geveler; Dirk Ribbrock; Nikola Rajovic; Nikola Puzovic; Alex Ramirez
Power consumption and energy efficiency are becoming critical aspects in the design and operation of large scale HPC facilities, and it is unanimously recognised that future exascale supercomputers will be strongly constrained by their power requirements. At current electricity costs, operating an HPC system over its lifetime can already be on par with the initial deployment cost. These power consumption constraints, and the benefits a more energy-efficient HPC platform may have on other societal areas, have motivated the HPC research community to investigate the use of energy-efficient technologies originally developed for the embedded and especially mobile markets. However, lower power does not always mean lower energy consumption, since execution time often also increases. In order to achieve competitive performance, applications then need to efficiently exploit a larger number of processors. In this article, we discuss how applications can efficiently exploit this new class of low-power architectures to achieve competitive performance. We evaluate if they can benefit from the increased energy efficiency that the architecture is supposed to achieve. The applications that we consider cover three different classes of numerical solution methods for partial differential equations, namely a low-order finite element multigrid solver for huge sparse linear systems of equations, a Lattice-Boltzmann code for fluid simulation, and a high-order spectral element method for acoustic or seismic wave propagation modelling. We evaluate weak and strong scalability on a cluster of 96 ARM Cortex-A9 dual-core processors and demonstrate that the ARM-based cluster can be more efficient in terms of energy to solution when executing the three applications compared to an x86-based reference machine.
computational science and engineering | 2009
Dominik Göddeke; Hilmar Wobker; Robert Strzodka; Jamaludin Mohd-Yusof; Patrick S. McCormick; Stefan Turek
We have previously presented an approach to include graphics processing units as co-processors in a parallel Finite Element multigrid solver called FEAST. In this paper we show that the acceleration transfers to real applications built on top of FEAST, without any modifications of the application code. The chosen solid mechanics code is well suited to assess the practicability of our approach due to higher accuracy requirements and a more diverse CPU/co-processor interaction. We demonstrate in detail that the single precision execution of the co-processor does not affect the final accuracy, and analyse how the local acceleration gains of factors 5.5-9.0 translate into 1.6- to 2.6-fold total speed-up.
Journal of Computational Science | 2011
Markus Geveler; Dirk Ribbrock; Sven Mallach; Dominik Göddeke
We present a software approach to hardware-oriented numerics which builds upon an augmented, previously published set of open-source libraries facilitating portable code development and optimisation on a wide range of modern computer architectures. In order to maximise e ciency, we exploit all levels of parallelism, including vectorisation within CPU cores, the Cell BE and GPUs, shared memory thread-level parallelism between cores, and parallelism between heterogeneous distributed memory resources in clusters. To evaluate and validate our approach, we implement a collection of modular building blocks for the easy and fast assembly and development of CFD applications based on the shallow water equations: We combine the Lattice-Boltzmann method with fluid-structure interaction techniques in order to achieve real-time simulations targeting interactive virtual environments. Our results demonstrate that recent multi-core CPUs outperform the Cell BE, while GPUs are significantly faster than conventional multi-threaded SSE code. In addition, we verify good scalability properties of our application on small clusters.
Computer Physics Communications | 2009
Danny van Dyk; Markus Geveler; Sven Mallach; Dirk Ribbrock; Dominik Göddeke; Carsten Gutwenger
Abstract We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEIs libraries, we achieve a two-fold speedup over straight forward C++ code using HONEIs SSE backend, and additional 3–4 and 4–16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development. Program summary Program title: HONEI Catalogue identifier: AEDW_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEDW_v1_0.html Program obtainable from: CPC Program Library, Queens University, Belfast, N. Ireland Licensing provisions: GPLv2 No. of lines in distributed program, including test data, etc.: 216 180 No. of bytes in distributed program, including test data, etc.: 1 270 140 Distribution format: tar.gz Programming language: C++ Computer: x86, x86_64, NVIDIA CUDA GPUs, Cell blades and PlayStation 3 Operating system: Linux RAM: at least 500 MB free Classification: 4.8, 4.3, 6.1 External routines: SSE: none; [1] for GPU, [2] for Cell backend Nature of problem: Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the underlying hardware towards heterogeneity and parallelism. This is particularly relevant for data-intensive problems stemming from discretisations with local support, such as finite differences, volumes and elements. Solution method: To address these issues, we present a hardware aware collection of libraries combining the advantages of modern software techniques and hardware oriented programming. Applications built on top of these libraries can be configured trivially to execute on CPUs, GPUs or the Cell processor. In order to evaluate the performance and accuracy of our approach, we provide two domain specific applications; a multigrid solver for the Poisson problem and a fully explicit solver for 2D shallow water equations. Restrictions: HONEI is actively being developed, and its feature list is continuously expanded. Not all combinations of operations and architectures might be supported in earlier versions of the code. Obtaining snapshots from http://www.honei.org is recommended. Unusual features: The considered applications as well as all library operations can be run on NVIDIA GPUs and the Cell BE. Running time: Depending on the application, and the input sizes. The Poisson solver executes in few seconds, while the SWE solver requires up to 5 minutes for large spatial discretisations or small timesteps. References: [1] http://www.nvidia.com/cuda . [2] http://www.ibm.com/developerworks/power/cell .