Markus Kowarschik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Markus Kowarschik is active.

Explore More

Publication

Featured researches published by Markus Kowarschik.

international conference on computational science | 2004

A Tool Suite for Simulation Based Analysis of Memory Access Behavior

Josef Weidendorfer; Markus Kowarschik; Carsten Trinitis

In this paper, two tools are presented: an execution driven cache simulator which relates event metrics to a dynamically built-up call-graph, and a graphical front end able to visualize the generated data in various ways. To get a general purpose, easy-to-use tool suite, the simulation approach allows us to take advantage of runtime instrumentation, i.e. no preparation of application code is needed, and enables for sophisticated preprocessing of the data already in the simulation phase. In an ongoing project, research on advanced cache analysis is based on these tools. Taking a multigrid solver as an example, we present the results obtained from the cache simulation together with real data measured by hardware performance counters.

Parallel Processing Letters | 2003

OPTIMIZATION AND PROFILING OF THE CACHE PERFORMANCE OF PARALLEL LATTICE BOLTZMANN CODES

Thomas Pohl; Markus Kowarschik; Jens Wilke; Klaus Iglberger; Ulrich Rüde

When designing and implementing highly efficient scientific applications for parallel computers such as clusters of workstations, it is inevitable to consider and to optimize the single-CPU performance of the codes. For this purpose, it is particularly important that the codes respect the hierarchical memory designs that computer architects employ in order to hide the effects of the growing gap between CPU performance and main memory speed. In this article, we present techniques to enhance the single-CPU efficiency of lattice Boltzmann methods which are commonly used in computational fluid dynamics. We show various performance results for both 2D and 3D codes in order to emphasize the effectiveness of our optimization techniques.

Lecture Notes in Computer Science | 2003

An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms

Markus Kowarschik; Christian Weiß

In order to mitigate the impact of the growing gap between CPU speed and main memory performance, today’s computer architectures implement hierarchical memory structures. The idea behind this approach is to hide both the low main memory bandwidth and the latency of main memory accesses which is slow in contrast to the floating-point performance of the CPUs. Usually, there is a small and expensive high speed memory sitting on top of the hierarchy which is usually integrated within the processor chip to provide data with low latency and high bandwidth; i.e., the CPU registers. Moving further away from the CPU, the layers of memory successively become larger and slower. The memory components which are located between the processor core and main memory are called cache memories or caches. They are intended to contain copies of main memory blocks to speed up accesses to frequently needed data [378], [392]. The next lower level of the memory hierarchy is the main memory which is large but also comparatively slow. While external memory such as hard disk drives or remote memory components in a distributed computing environment represent the lower end of any common hierarchical memory design, this paper focuses on optimization techniques for enhancing cache performance.

Archive | 2006

Parallel Geometric Multigrid

Frank Hülsemann; Markus Kowarschik; Marcus Mohr; Ulrich Rüde

Multigrid methods are among the fastest numerical algorithms for the solution of large sparse systems of linear equations. While these algorithms exhibit asymptotically optimal computational complexity, their efficient parallelisation is hampered by the poor computation-to-communication ratio on the coarse grids. Our contribution discusses parallelisation techniques for geometric multigrid methods. It covers both theoretical approaches as well as practical implementation issues that may guide code development.

Computing | 2000

Cache-aware multigrid methods for solving Poisson's equation in two dimensions

Markus Kowarschik; Christian Weiβ; Wolfgang Karl; Ulrich Rüde

Abstract Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that the need for data locality due to poor main memory latency and limited bandwidth is entirely neglected by many developers designing numerical software. Only when most of the data to be accessed during the computation are found in the system cache (or in one of the caches if the machine architecture comprises a cache hierarchy) fast program execution can be expected. Otherwise, i.e. in case of a significant rate of cache misses, the processor must stay idle until the necessary operands are fetched from main memory, whose cycle time is in general extremely large compared to the time needed to execute a floating point instruction. In this paper, we describe program transformation techniques developed to improve the cache performance of two-dimensional multigrid algorithms. Although we merely consider the solution of Poissons equation on the unit square using structured grids, our techniques provide valuable hints towards the efficient treatment of more general problems.

international workshop on petri nets and performance models | 1997

State space construction and steady-state solution of GSPNs on a shared-memory multiprocessor

Susann C. Allmaier; Markus Kowarschik; Graham Horton

A common approach for the quantitative analysis of a generalized stochastic Petri net (GSPN) is to generate its entire state space and then solve the corresponding continuous-time Markov chain (CTMC) numerically. This analysis often suffers from two major problems: the state space explosion and the stiffness of the CTMC. In this paper we present parallel algorithms for shared-memory machines that attempt to alleviate both of these difficulties: the large main memory capacity of a multiprocessor can be utilized and long computation times are reduced by efficient parallelization. The algorithms comprise both CTMC construction and numerical steady-state solution. We give experimental results obtained with a Convex SPP1600 shared-memory multiprocessor that show the behavior of the algorithms and the parallel speedups obtained.

Concurrency and Computation: Practice and Experience | 2004

Parallel object-oriented framework optimization

Daniel J. Quinlan; Markus Schordan; Brian Miller; Markus Kowarschik

Sophisticated parallel languages are difficult to develop; most parallel distributed memory scientific applications are developed using a serial language, expressing parallelism through third party libraries (e.g. MPI). As a result, frameworks and libraries are often used to encapsulate significant complexities. We define a novel approach to optimize the use of libraries within applications. The resulting tool, named ROSE, leverages the additional semantics provided by library‐defined abstractions enabling library specific optimization of application codes. It is a common perception that performance is inversely proportional to the level of abstraction. Our work shows that this is not the case if the additional semantics can be leveraged. We show how ROSE can be used to leverage the semantics within the compile‐time optimization. Copyright

european conference on parallel processing | 2003

Cache performance optimizations for parallel lattice Boltzmann codes

Jens Wilke; Thomas Pohl; Markus Kowarschik; Ulrich Rüde

When designing and implementing highly efficient scientific applications for parallel computers such as clusters of workstations, it is inevitable to consider and to optimize the single–CPU performance of the codes. For this purpose, it is particularly important that the codes respect the hierarchical memory designs that computer architects employ in order to hide the effects of the growing gap between CPU performance and main memory speed. In this paper, we present techniques to enhance the single–CPU efficiency of lattice Boltzmann methods which are commonly used in computational fluid dynamics. We show various performance results to emphasize the effectiveness of our optimization techniques.

languages and compilers for parallel computing | 2001

The specification of source-to-source transformations for the compile-time optimization of parallel object-oriented scientific applications

Daniel J. Quinlan; Markus Schordan; Bobby Philip; Markus Kowarschik

The performance of object-oriented applications in scientific computing often suffers from the inefficient use of high-level abstractions provided by underlying libraries. Since these library abstractions are user-defined and not part of the programming language itself there is no compiler mechanism to respect their semantics and thus to perform appropriate optimizations. In this paper we outline the design of ROSE and focus on the discussion of two approaches for specifying and processing complex source code transformations. These techniques are intended to be as easy and intuitive as possible for potential ROSE users; i.e., for designers of object-oriented scientific libraries, people most often with no compiler expertise.

Archive | 2000

Fixed and Adaptive Cache Aware Algorithms for Multigrid Methods

Craig C. Douglas; Jonathan Hu; Wolfgang Karl; Markus Kowarschik; Ulrich Rüde; Christian Weiß

Many current computer designs, including the node architecture of most parallel supercomputers, employ caches and a hierarchical memory structure. Hence, the speed of a multigrid code depends increasingly on how well the cache structure is exploited. Typical multigrid applications are running on data sets much too large to fit into any cache. Thus, applications should reuse copies of the data that is once brought into the cache as often as possible. In this paper, suitable fixed and adaptive blocking strategies for both structured and unstructured grids are introduced.

Explore More