Matthias Korch | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthias Korch is active.

Explore More

Publication

Featured researches published by Matthias Korch.

Concurrency and Computation: Practice and Experience | 2004

A comparison of task pools for dynamic load balancing of irregular algorithms

Matthias Korch; Thomas Rauber

Since a static work distribution does not allow for satisfactory speed‐ups of parallel irregular algorithms, there is a need for a dynamic distribution of work and data that can be adapted to the runtime behavior of the algorithm. Task pools are data structures which can distribute tasks dynamically to different processors where each task specifies computations to be performed and provides the data for these computations. This paper discusses the characteristics of task‐based algorithms and describes the implementation of selected types of task pools for shared‐memory multiprocessors. Several task pools have been implemented in C with POSIX threads and in Java. The task pools differ in the data structures to store the tasks, the mechanism to achieve load balance, and the memory manager used to store the tasks. Runtime experiments have been performed on three different shared‐memory systems using a synthetic algorithm, the hierarchical radiosity method, and a volume rendering algorithm. Copyright

Journal of Parallel and Distributed Computing | 2006

Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Matthias Korch; Thomas Rauber

The increasing gap between the speeds of processors and main memory has led to hardware architectures with an increasing number of caches to reduce average memory access times. Such deep memory hierarchies make the sequential and parallel efficiency of computer programs strongly dependent on their memory access pattern. In this paper, we consider embedded Runge-Kutta methods for the solution of ordinary differential equations and study their efficient implementation on different parallel platforms. In particular, we focus on ordinary differential equations which are characterized by a special access pattern as it results from the spatial discretization of partial differential equations by the method of lines. We explore how the potential parallelism in the stage vector computation of such equations can be exploited in a pipelining approach leading to a better locality behavior and a higher scalability. Experiments show that this approach results in efficiency improvements on several recent sequential and parallel computers.

conference on high performance computing (supercomputing) | 2004

Performance Evaluation of Task Pools Based on Hardware Synchronization

Ralf Hoffmann; Matthias Korch; Thomas Rauber

A task-based execution provides a universal approach to dynamic load balancing for irregular applications. Tasks are arbitrary units of work that are created dynamically at run-time and that are stored in a parallel data structure, the task pool, until they are scheduled onto a processor for execution. In this paper, we evaluate the performance of different task pool implementations for shared-memory computer systems using several realistic applications. We consider task pools with different data structures, different load balancing strategies and a specialized memory management. In particular, we use synchronization operations based on hardware support that is available on many modern micro-processors. We show that the resulting task pool implementations lead to a much better performance than implementations using Pthreads library calls for synchronization. The applications considered are parallel quicksort, volume rendering, ray tracing, and hierarchical radiosity. The target machines are an IBM p690 server and a SunFire 6800.

computer software and applications conference | 2008

Transformation of Legacy Software into Client/Server Applications through Pattern-Based Rearchitecturing

Sascha Hunold; Matthias Korch; Björn Krellner; Thomas Rauber; Thomas Reichel; Gudula Rünger

In this article, we address the problem of modularizing legacy applications with monolithic structure, primarily focusing on business software written in an object-oriented programming language. We introduce the TransFormr toolkit that guides the developer through the entire incremental transformation process. It is the goal of the transformation to separate the original software into several independent replaceable components to support the migration of legacy code to new hardware or to integrate legacy components into modern enterprise applications. We show the effectiveness of our approach by demonstrating a pattern-based transformation of classes in a case study.

european conference on parallel processing | 2003

Scalable Parallel RK Solvers for ODEs Derived by the Method of Lines

Matthias Korch; Thomas Rauber

This paper describes how the specific access structure of the Brusselator equation, a typical example for ordinary differential equations (ODEs) derived by the method of lines, can be exploited to obtain scalable distributed-memory implementations of explicit Runge-Kutta (RK) solvers. These implementations need less communication and therefore achieve better speed-ups than general explicit RK implementations. Particularly, we consider implementations based on a pipelining computation scheme leading to an improved locality behavior.

european conference on parallel processing | 2007

Locality optimized shared-memory implementations of iterated runge-kutta methods

Matthias Korch; Thomas Rauber

Iterated Runge-Kutta (IRK) methods are a class of explicit solution methods for initial value problems of ordinary differential equations (ODEs) which possess a considerable potential for parallelism across the method and the ODE system. In this paper, we consider the sequential and parallel implementation of IRK methods with the main focus on the optimization of the locality behavior. We introduce different implementation variants for sequential and shared-memory computer systems and analyze their runtime and cache performance on two modern supercomputer systems.

Concurrency and Computation: Practice and Experience | 2011

Scalability and locality of extrapolation methods on large parallel systems

Matthias Korch; Thomas Rauber; Carsten Scholtes

Time‐dependent processes can often be modeled by systems of ordinary differential equations (ODEs). Solving such a system for a detailed model can be highly computationally intensive. We investigate explicit extrapolation methods for solving such systems efficiently on current highly parallel supercomputer systems with shared‐or distributed‐memory architecture. We analyze and compare the scalability of several parallelization variants, some of them using multiple levels of parallelization. For a large class of ODE systems, data access costs are reduced considerably by exploiting the special structure of the ODE system. Furthermore, by employing a pipeline‐like loop structure, the locality of memory references is increased for such systems resulting in a better utilization of the cache hierarchy. Runtime experiments show that the optimized implementations can deliver a high scalability. Copyright

international conference on parallel processing | 2004

Using hardware operations to reduce the synchronization overhead of task pools

Ralf Hoffmann; Matthias Korch; Thomas Rauber

We consider the task-based execution of parallel irregular applications, which are characterized by an unpredictable computational structure induced by the input data. The dynamic load balancing required to execute such applications efficiently can be provided by task pools. Thus, the performance of a task-based irregular application is tightly coupled to the scalability and the overhead of the task pool used to execute it. In order to reduce this overhead this article considers the use of the hardware-specific synchronization operations compare & swap and load & reserve/store conditional. We present several different realizations of task pools using these operations. Runtime experiments on two shared-memory machines, a SunFire 6800 and an IBM p690, show that the new implementations obtain a significantly higher performance than implementations relying on the POSIX thread library for synchronization.

international conference on parallel processing | 2002

Evaluation of task pools for the implementation of parallel irregular algorithms

Matthias Korch; Thomas Rauber

/sup T/ask pools are data structures for the dynamic distribution of work to processors. This paper compares several realizations of task pools resulting from different internal organizations such as shared or distributed organizations as well as a combination of them. The effect of different memory managers is also considered. The paper gives a detailed comparison of the resulting performance for task pools implemented in C with POSIX threads for selected irregular applications on current multiprocessor machines.

Journal of Parallel and Distributed Computing | 2014

Online auto-tuning for the time-step-based parallel solution of ODEs on shared-memory systems

Natalia Kalinnik; Matthias Korch; Thomas Rauber

Abstract This article considers automatic performance tuning of time-step-based parallel solution methods for initial value problems (IVPs) of systems of ordinary differential equations (ODEs). We apply auto-tuning to the parallel execution of a class of explicit predictor–corrector (PC) methods of Runge–Kutta (RK) type on shared-memory architectures. The performance of parallel multi-threaded implementation variants of these methods depends on various factors only known at runtime, for example, the coupling structure of the ODE system to be solved, the memory access pattern resulting from this coupling structure, and the number of threads executing the program. We propose an online auto-tuning approach that exploits the time-stepping nature of ODE methods by selecting the best parallel implementation variant from a set of candidate implementations at runtime during the first time steps. Thus, the auto-tuning process is not isolated from the computation, but rather contributes to the progress of the solution process. The search space of candidate implementations is a priori reduced by estimating the synchronization overhead of each implementation variant. For implementation variants containing tiled loops, suitable tile sizes are selected using a heuristic empirical search guided by an analytical model. Runtime experiments with two different test problems show the efficiency of the online auto-tuning approach on two different shared-memory systems equipped with 48 and 1040 cores.

Explore More