Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Carsten Scholtes is active.

Publication


Featured researches published by Carsten Scholtes.


Concurrency and Computation: Practice and Experience | 2011

Scalability and locality of extrapolation methods on large parallel systems

Matthias Korch; Thomas Rauber; Carsten Scholtes

Time‐dependent processes can often be modeled by systems of ordinary differential equations (ODEs). Solving such a system for a detailed model can be highly computationally intensive. We investigate explicit extrapolation methods for solving such systems efficiently on current highly parallel supercomputer systems with shared‐or distributed‐memory architecture. We analyze and compare the scalability of several parallelization variants, some of them using multiple levels of parallelization. For a large class of ODE systems, data access costs are reduced considerably by exploiting the special structure of the ODE system. Furthermore, by employing a pipeline‐like loop structure, the locality of memory references is increased for such systems resulting in a better utilization of the cache hierarchy. Runtime experiments show that the optimized implementations can deliver a high scalability. Copyright


european conference on parallel processing | 1996

Shared-Memory Implementation of an Irregular Particle Simulation Method

Thomas Rauber; Gudula Rünger; Carsten Scholtes

We investigate a parallel implementation of an irregular particle simulation algorithm. We concentrate on the issue which programming and system support is needed to yield an efficient implementation for a large number of processors. As execution platform we use the SBPRAM, a shared memory machine with up to 4096 processors.


high performance computing and communications | 2011

Memory-Intensive Applications on a Many-Core Processor

Matthias Korch; Thomas Rauber; Carsten Scholtes

Future micro-processors are expected to contain an increasing number of cores. Different models exist for efficiently organizing the cores of the resulting many-core processors. The Single-Chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It is optimized for providing to each core a programming model similar to that of the nodes of a message-passing distributed system. We have examined the performance of a memory-intensive application on the SCC. The application solves Initial Value Problems (IVPs) of Ordinary Differential Equations (ODEs). Experiments with different configurations and optimizations of this application have been performed. The evaluation of these experiments reveals bottlenecks and provides hints for optimizing applications for similar many-core architectures.


Simulation Practice and Theory | 1998

Execution behavior analysis and performance prediction for a shared-memory implementation of an irregular particle simulation method

Thomas Rauber; Gudula Rünger; Carsten Scholtes

Abstract Many computational-intensive problems from science and engineering are irregular in nature. This makes it difficult to develop an efficient parallel implementation, even for shared-memory machines. As a typical example, we investigate a parallel implementation of an irregular particle simulation algorithm. We concentrate on the issue which programming and system support is needed to yield an efficient implementation for a large number of processors. As an execution platform we use the SB-PRAM, a shared memory machine with up to 2048 processors. The processors of the SB-PRAM can access the global memory in unit time which is the basis for an exact performance prediction. Common approaches for parallel implementations like lock protection for concurrent accesses and sequential or distributed task queues are replaced by more efficient access mechanisms and data structures which can be realized by the powerful multiprefix operations of the SB-PRAM. Their use simplifies the implementation and yields large speedup values.


european conference on parallel processing | 2010

Scalability and locality of extrapolation methods for distributed-memory architectures

Matthias Korch; Thomas Rauber; Carsten Scholtes

The numerical simulation of systems of ordinary differential equations (ODEs), which arise from the mathematical modeling of timedependent processes, can be highly computationally intensive. Thus, efficient parallel solution methods are desirable. This paper considers the parallel solution of systems of ODEs by explicit extrapolation methods. We analyze and compare the scalability of several implementation variants for distributed-memory architectures which make use of different load balancing strategies and different loop structures. By exploiting the special structure of a large class of ODE systems, the communication costs can be reduced considerably. Further, by processing the microsteps using a pipeline-like loop structure, the locality of memory references can be increased and a better utilization of the cache hierarchy can be achieved. Runtime experiments on modern parallel computer systems show that the optimized implementations can deliver a high scalability.


computational science and engineering | 2005

A method to derive the cache performance of irregular applications on machines with direct mapped caches

Carsten Scholtes

A probabilistic method is presented to derive the cache performance of irregular applications on machines with direct mapped caches from inspection of the source code. The method has been applied to analyse both a program to multiply a sparse matrix with a dense matrix and a program for the Cholesky-factorisation of a sparse matrix. The resulting predictions are compared with measurements of the respective programs.


International Journal of High Speed Computing | 1999

SCALABILITY OF SPARSE CHOLESKY FACTORIZATION

Thomas Rauber; Gudula Rünger; Carsten Scholtes

A variety of algorithms have been proposed for sparse Cholesky factorization, including left-looking, right-looking, and supernodal algorithms. This article investigates shared-memory implementations of several variants of these algorithms in a task-oriented execution model with dynamic scheduling. In particular, we consider the degree of parallelism, the scalability, and the scheduling overhead of the different algorithms. Our emphasis lies in the parallel implementation for relatively large numbers of processors. As execution platform, we use the SB-PRAM, a shared-memory machine with up to 2048 processors. This article can be considered as a case study in which we try to answer the question of which performance we can hope to get for a typical irregular application on an ideal machine on which the locality of memory accesses can be ignored but for which the overhead for the management of data structures still takes effect. The investigation shows that certain algorithms are the best choice for a small number of processors, while other algorithms are better for many processors.


international symposium on parallel and distributed computing | 2012

Diamond-Like Tiling Schemes for Efficient Explicit Euler on GPUs

Matthias Korch; Julien Kulbe; Carsten Scholtes

GPU computing offers a high potential of raw processing power at comparatively low costs. This paper investigates optimization techniques for solving initial value problems (IVPs) of ordinary differential equations (ODEs) on GPUs. Different techniques, especially for exploiting the GPU memory hierarchy, are discussed, and corresponding OpenCL implementations of the explicit Euler method are compared using runtime experiments. The results show considerable performance improvements in many situations. Due to the basic character of the explicit Euler method, the results of this investigation can guide the optimization of more complex ODE methods with higher order and better stability on GPUs.


international conference on parallel and distributed systems | 2010

Mixed-Parallel Implementations of Extrapolation Methods with Reduced Synchronization Overhead for Large Shared-Memory Computers

Matthias Korch; Thomas Rauber; Carsten Scholtes

Extrapolation methods belong to the class of one-step methods for the solution of systems of ordinary differential equations (ODEs). In this paper, we present parallel implementation variants of extrapolation methods for large shared-memory computer systems which exploit pure data parallelism or mixed task and data parallelism and make use of different load balancing strategies and different loop structures. In addition to general implementation variants suitable for ODE systems with arbitrary access structure, we devise specialized implementation variants which exploit the specific access structure of a large class of ODE systems to reduce synchronization costs and to improve the locality of memory references. We analyze and compare the scalability and the locality behavior of the implementation variants on an SGI Altix 4700 using up to 500 threads.


international conference on parallel architectures and compilation techniques | 2007

Trace-based Automatic Padding for Locality Improvement with Correlative Data Visualization Interface

Marco Hobbel; Thomas Rauber; Carsten Scholtes

An important goal for most programs from scientific or engineering computing is to reduce the execution time as far as possible. The programs refer to user data in problem specific access patterns. For regular applications, it can be expected that repeated program executions exhibit a similar memory access pattern. Many data intensive applications benefit from a high memory bandwidth which, on modern architectures, is supported by a multi-level cache hierarchy. The efficiency of caching depends strongly on temporal or spatial reuse, thus, potential conflicts should be avoided.

Collaboration


Dive into the Carsten Scholtes's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Gudula Rünger

Chemnitz University of Technology

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge