Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jarmo Rantakokko is active.

Publication


Featured researches published by Jarmo Rantakokko.


international workshop on openmp | 2005

Geographical locality and dynamic data migration for OpenMP implementations of adaptive PDE solvers

Markus Nordén; Henrik Löf; Jarmo Rantakokko; Sverker Holmgren

On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement. The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing necessary. Due to the dynamically changing memory access pattern caused by the runtime adaption, it is a challenging task to achieve a high degree of geographical locality. The main conclusions of the study are: (1) that geographical locality is very important for the performance of the solver, (2) that the performance can be improved significantly using dynamic page migration of misplaced data, (3) that a migrate-on-next-touch directive works well whereas the first-touch strategy is less advantageous for programs exhibiting a dynamically changing memory access patterns, and (4) that the overhead for such migration is low compared to the total execution time.


Parallel Algorithms and Applications | 1998

A FRAMEWORK FOR PARTITIONING STRUCTURED GRIDS WITH INHOMOGENEOUS WORKLOAD

Jarmo Rantakokko

A framework is presented for partitioning arrays with irregular workload. Within the framework structured and unstructured methods are combined in a new approach to partition data, based on blocks and block operations. A new variant of the recursive spectral bisection method suitable in this context is suggested. The use of the framework is demonstrated for a real life application, ocean modeling of the Baltic Sea. In this case study, the new approach gives very good results while standard partitioning methods cannot fulfill all the listed requirements. The operations have been implemented in a Fortran 90 software package with an object-oriented design.


Modern software tools for scientific computing | 1997

Object-oriented construction of parallel PDE solvers

Michael Thuné; Eva Mossberg; Peter Olsson; Jarmo Rantakokko; Krister Åhlander; Kurt Otto

An object-oriented approach is taken to the problem of formulating portable, easy-to-modify PDE solvers for realistic problems in three space dimensions. The resulting software library, Cogito, contains tools for writing programs to be executed on MIMD computers with distributed memory. Difference methods on composite, structured grids are supported. Most of the Cogito classes have been implemented in Fortran 77, in such a way that the object-oriented design is visible. With respect to parallel performance, these tools yield code that is comparable to parallel solvers written in plain Fortran 77. The resulting programs can be executed without modifications on a large number of multicomputer platforms, and also on serial computers.


International Journal of Parallel Programming | 2007

Dynamic data migration for structured AMR solvers

Markus Nordén; Henrik Löf; Jarmo Rantakokko; Sverker Holmgren

On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement (AMR). The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing necessary. Due to the dynamically changing memory access pattern caused by the runtime adaption, it is a challenging task to achieve a high degree of geographical locality. The main conclusions of the study are: (1) that geographical locality is very important for the performance of the solver, (2) that the performance can be improved significantly using dynamic page migration of misplaced data, (3) that a migrate-on-next-touch directive works well whereas the first-touch strategy is less advantageous for programs exhibiting a dynamically changing memory access patterns, and (4) that the overhead for such migration is low compared to the total execution time.


Parallel Algorithms and Applications | 2002

PERFORMANCE OF PDE SOLVERS ON A SELF-OPTIMIZING NUMA ARCHITECTURE

Sverker Holmgren; Markus Nordén; Jarmo Rantakokko; Dan Wallin

Abstract The performance of shared-memory (OpenMP) implementations of three different PDE solver kernels representing finite difference methods, finite volume methods and spectral methods has been investigated. The experiments have been performed on a self-optimizing NUMA system, the Sun Orange prototype, using different data placement and thread scheduling strategies. The results show that correct data placement is very important for the performance for all solvers. However, the Orange system has a unique capability of automatically changing the data distribution at run time through both migration and replication of data. For reasonable large PDE problems, we find that the time to do this is negligible compared to the total solve time. Also, the performance after the migration and replication process has reached steady-state is the same as what is achieved if data is optimally placed at the beginning of the execution using hand tuning. This shows that, for the application studied, the self-optimizing features are successful, and shared memory code without explicit data distribution directives yields good performance.


parallel computing | 1997

Strategies for parallel variational data assimilation

Jarmo Rantakokko

Abstract The prospects to parallelize a variational data assimilation scheme from the starting point of an existing parallel forecast model and its adjoint equations have been investigated. Three parallelization strategies to do the implementation and two partitioning algorithms to compute data distributions have been suggested. Numerical simulations of the parallelizations show that the strategies and the partitioning algorithms can be combined to perform well on both shared memory and distributed memory machines and that they give better results than direct use of standard parallelization methods.


parallel computing | 1998

Comparison of Partitioning Strategies for PDE Solvers on Multiblock Grids

Jarmo Rantakokko

Different partitioning strategies for multiblock grids have been compared experimentally. The numerical experiments have been performed on a 512 processor Cray T3D using a compressible two dimensional Navier-Stokes solver. Some complementary results are made with an advection equation solver using a Cray T3E-900. The results show that the behavior of the different parallelization strategies depends very much on the number of subgrids and their sizes as well as the number of available processors. In order to get optimal performance for a certain problem and processor configuration, the partitioning strategy must be chosen with regard to these aspects. Our results give guidelines for this.


Archive | 2009

Parallel Structured Adaptive Mesh Refinement

Jarmo Rantakokko; Michael Thuné

Parallel structured adaptive mesh refinement is a technique for efficient utilization of computational resources. It reduces the computational effort and memory requirements needed for numerical simulation of complex phenomena, described by partial differential equations. Structured adaptive mesh refinement (SAMR) is applied in simulations where the domain is divided into logically rectangular patches, where each patch is discretized with a structured mesh. The purpose of adaptive mesh refinement is to automatically adapt the mesh to the resolution required to represent important features of the simulated phenomenon in different subdomains. In a parallel computing context, an important consequence of the adaptation is that the dynamically changing resolution leads to a dynamically changing work load, data volume, and communication pattern at run-time. This calls for dynamic load balancing and has implications for data placement as well as parallelization granularity. This chapter gives an overview of structured adaptive mesh refinement approaches. After a brief introductory survey of SAMR techniques and software packages, the main part of the chapter addresses various issues related to implementation of SAMR on parallel computers. In particular programming models, data placement and load balancing are discussed, for shared memory as well as distributed memory platforms. Various approaches and algorithms are presented. The appropriate choice of dynamic load balancing algorithm, data placement strategy, programming model, etc., depends on both the application state and the computer platform. There


parallel computing | 2000

A Local Refinement Algorithm for Data Partitioning

Jarmo Rantakokko

A local refinement method for data partitioning has been constructed. The method balances the workload and minimizes locally the number of edge-cuts. The arithmetic complexity of the algorithm is low. The method is well suited for refinement in multilevel partitioning where the intermediate partitions are near optimal but slightly unbalanced. It is also useful for improvement of global partitioning methods and repartitioning in dynamic problems where the workload changes slightly. The algorithm has been compared with corresponding methods in Chaco and Metis in the context of multilevel partitioning. The cost of carrying out the partitioning with our method is lower than in Chaco and of the same order as in Metis, and still the quality of the partitioning is comparable and in some cases even better.


International Journal of Parallel, Emergent and Distributed Systems | 2006

Algorithmic Optimizations of a Conjugate Gradient Solver on Shared Memory Architectures

Henrik Löf; Jarmo Rantakokko

OpenMP is an architecture-independent language for programming in the shared memory model. OpenMP is designed to be simple and powerful in terms of programming abstractions. Unfortunately, the architecture-independent abstractions sometimes come with the price of low parallel performance. This is especially true for applications with an unstructured data access pattern running on distributed shared memory systems (DSM). Here, proper data distribution and algorithmic optimizations play a vital role for performance. In this article, we have investigated ways of improving the performance of an industrial class conjugate gradient (CG) solver, implemented in OpenMP running on two types of shared memory systems. We have evaluated bandwidth minimization, graph partitioning and reformulations of the original algorithm reducing global barriers. By a detailed analysis of barrier time and memory system performance, we found that bandwidth minimization is the most important optimization reducing both L2 misses and remote memory accesses. On a uniform memory system, we get perfect scaling. On a NUMA system, the performance is significantly improved with the algorithmic optimizations leaving the system dependent global reduction operations as a bottleneck.

Collaboration


Dive into the Jarmo Rantakokko's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge