Andreas Schäfer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andreas Schäfer is active.

Explore More

Publication

Featured researches published by Andreas Schäfer.

international conference on conceptual structures | 2011

High Performance Stencil Code Algorithms for GPGPUs

Andreas Schäfer; Dietmar Fey

Abstract In this paper we investigate how stencil computations can be implemented on state-of-the-art general purpose graphics processing units (GPGPUs). Stencil codes can be found at the core of many numerical solvers and physical simulation codes and are therefore of particular interest to scientific computing research. GPGPUs have gained a lot of attention recently because of their superior floating point performance and memory bandwidth. Nevertheless, especially memory bound stencil codes have proven to be challenging for GPGPUs, yielding lower than to be expected speedups. We chose the Jacobi method as a standard benchmark to evaluate a set of algorithms on NVIDIAs latest Fermi chipset. One of our fastest algorithms is a parallel wavefront update. It exploits the enlarged on-chip shared memory to perform two time step updates per sweep. To the best of our knowledge, it represents the first successful applicationof temporal blocking for 3D stencils on GPGPUs and thereby exceeds previous results by a considerable margin. It is also the first paper to study stencil codes on Fermi.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2008

LibGeoDecomp: A Grid-Enabled Library for Geometric Decomposition Codes

Andreas Schäfer; Dietmar Fey

In this paper we present first results obtained with LibGeoDecomp, a work in progress library for scientific and engineering simulations on structured grids, geared at multi-cluster and grid systems. Todays parallel computers range from multi-core PCs to highly scaled, heterogeneous grids. With the growing complexity of grid resources on the one hand, and the increasing importance of computer based simulations on the other, the agile development of highly efficient and adaptable parallel applications is imperative. LibGeoDecomp is to our knowledge the first library to support all state of the art features from dynamic load balancing and exchangeable domain decomposition techniques to ghost zones with arbitrary width and parallel IO, along with a hierarchical parallelization whose layers can be adapted to reflect the underlying hierarchy of the grid system.

Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems | 2013

Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers

Thomas Heller; Hartmut Kaiser; Andreas Schäfer; Dietmar Fey

With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well. In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage. The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACCs Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPXs uniform programming model simplifies writing highly parallel code for heterogeneous resources.

2014 Workshop on Exascale MPI at Supercomputing Conference | 2014

To INT_MAX... and beyond!: exploring large-count support in MPI

Jeff R. Hammond; Andreas Schäfer; Robert Latham

In order to describe a structured region of memory, the routines in the MPI standard use a (count, datatype) pair. The C specification for this convention uses an int type for the count. Since C int types are nearly always 32 bits large and signed, counting more than 231 elements poses a challenge. Instead of changing the existing MPI routines, and all consumers of those routines, the MPI Forum asserts that users can build up large datatypes from smaller types. To evaluate this hypothesis and to provide a user-friendly solution to the large-count issue, we have developed BigMPI, a library on top of MPI that maps large-count MPI-like functions to MPI-3 standard features. BigMPI demonstrates a way to perform such a construction, reveals shortcomings of the MPI standard, and uncovers bugs in MPI implementations.

international conference on computational science | 2008

Pollarder: An Architecture Concept for Self-adapting Parallel Applications in Computational Science

Andreas Schäfer; Dietmar Fey

Utilizing grid computing resources has become crucial to advances in todays computational science and engineering. To sustain efficiency, applications have to adapt to changing execution environments. Suitable implementations require huge efforts in terms of time and personnel. In this paper we describe the design of the Pollarder framework, a work in progress which offers a new approach to grid application componentization. It is based on a number of specialized design patterns to improve code reusability and flexibility. An adaptation layer handles environment discovery and is able to construct self-adapting applications from a user supplied library of components. We provide first experiences gathered with a prototype implementation.

Proceedings of the 21st European MPI Users' Group Meeting on | 2014

A Portable Petascale Framework for Efficient Particle Methods with Custom Interactions

Andreas Schäfer; Thomas Heller; Dietmar Fey

We report our advances in extending the computer simulation library LibGeoDecomp for particle-based models. This class of models ranges from N-body codes for astrophysics to molecular dynamics (MD) simulations for drug design. Current software packages primarily aim at offering solutions for specific sciences, e.g. GROMACS for MD. Conversely, our approach caters for users who need to implement new models and methods, which are not yet covered by existing packages. Instead of restricting the user to the composition of predefined kernels, our framework allows users to describe their model by means of C++ classes and functions. Our framework takes over parallelization and data storage. The API is based on metaprogramming techniques (templates for code generation, flyweights for efficient callbacks). It is technically a domain specific language (DSL), but embedded into C++. Users do not have to reimplement their existing code in a new language, they just need to move data storage to our library and let it take over spatial and temporal loops. With the help of our library, petascale applications can be written in mere hundreds lines of C++ code. Using a proxy application, which implements a force-based n-body model, we were able to demonstrate that our library scales up to 9.4 PFLOPS on the Titan supercomputer, and up to 1.85 M MPI processes on JUQUEEN.

ieee international conference on high performance computing data and analytics | 2012

A Predictive Performance Model for Stencil Codes on Multicore CPUs

Andreas Schäfer; Dietmar Fey

In this paper we present an analytical performance model which yields estimates for the performance of stencil based simulations. Unlike previous models, we do neither rely on prototype implementations, nor do we examine the computational intensity only. Our model allows for memory optimizations such as cache blocking and non-temporal stores. Multi-threading, loop-unrolling, and vectorization are covered, too. The model is built from a sequence of 1D loops. For each loop we map the different parts of the instruction stream to the corresponding CPU pipelines and estimate their throughput. The load/store streams may be affected not only by their destination (the cache level or NUMA domain they target), but also by concurrent access of other threads. Evaluation of a Jacobi solver and the Himeno benchmark shows that the model is accurate enough to capture real live kernels.

irregular applications: architectures and algorithms | 2011

Parallel simulation of dendritic growth on unstructured grids

Andreas Schäfer; Julian Hammer; Dietmar Fey

In this paper we present our findings from parallelizing a material science application which simulates dendritic growth in molten metal alloys. The simulation itself is based on an iterative 2D meshfree model. The simulation cells are tightly coupled and depend on neighbors in a relatively large radius, so the code turned out to be communication bound. We present two different approaches for the parallelization: one specifically written for this application, and one which uses LibGeoDecomp, a stencil code library. Benchmarks show that the stencil code library performs much better than expected, despite not being designed for this use case.

international conference on conceptual structures | 2016

Evaluating Performance and Energy-efficiency of a Parallel Signal Correlation Algorithm on Current Multi and Manycore Architectures

Arne Hendricks; Thomas Heller; Andreas Schäfer; Maximilian Kasparek; Dietmar Fey

Increasing variety and affordability of multi- and many-core embedded architectures can pose both a challenge and opportunity to developers of high performance computing applications. In this paper we present a case study where we develop and evaluate a unified parallel approach to a signal-correlation algorithm,currently in-use in a commercial/industrial locating system. We utilize both HPX C++ and CUDA runtimes to achieve scalable code for current embedded multi- and many-core architectures (NVIDIA Tegra, Intel Broadwell M, Arm Cortex A-15). We also compare our approach onto traditional high-performance hardware as well as a native embedded many-core variant. To increase the accuracy of our performance analysis we introduce dedicated performance model. The results show that our approach is feasible and enables us to harness the advantages of modern micro-server architectures, but also indicates that there are limitations to some of the currently existing many-core embedded architectures, that can lead to traditional hardware being superior both in efficiency and absolute performance.

international symposium on computing and networking | 2015

A Non-intrusive Technique for Interfacing Legacy Fortran Codes with Modern C++ Runtime Systems

Zachary D. Byerly; Hartmut Kaiser; Steven Brus; Andreas Schäfer

Many HPC applications developed over the past two decades have used Fortran and MPI-based parallelization. As the size of todays HPC resources continues to increase, these codes struggle to efficiently utilize the million-way parallelism of these platforms. Rewriting these codes from scratch to leverage modern programming paradigms would be time-consuming and error-prone. We evaluate a robust approach for interfacing with next-generation C++-based libraries and drivers. We have successfully used this technique to modify the Fortran code DGSWEM (Discontinuous Galerkin Shallow Water Equation Model), allowing it to take advantage of the new parallel runtime system HPX. Our goal was to make as few modifications to the DGSWEM Fortran source code as possible, thereby minimizing the chances of introducing bugs and reducing the amount of re-verification that needed to be done.

Explore More