Henrik Löf | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Henrik Löf is active.

Explore More

Publication

Featured researches published by Henrik Löf.

european conference on parallel processing | 2003

THROOM: Supporting POSIX multithreaded binaries on a cluster

Henrik Löf; Zoran Radovic; Erik Hagersten

I propose a set of criteria which distinguish a grand challenge in science or engineering from the many other kinds of short-term or long-term research problems that engage the interest of scientists and engineers. The primary purpose of the formulation and promulgation of a grand challenge is to contribute to the advancement of some branch of science or engineering. A grand challenge represents a commitment by a significant section of the research community to work together towards a common goal, agreed to be valuable and achievable by a team effort within a predicted timescale. The challenge is formulated by the researchers themselves as a focus for the research that they wish to pursue in any case, and which they believe can be pursued more effectively by advance planning and co-ordination. Unlike other common kinds of research initiative, a grand challenge should not be triggered by hope of short-term economic, commercial, medical, military or social benefits; and its initiation should not wait for political promotion or for prior allocation of special funding. The goals of the challenge should be purely scientific goals of the advancement of skill and of knowledge. It should appeal not only to the curiosity of scientists and to the ambition of engineers; ideally it should appeal also to the imagination of the general public; thereby it may enlarge the general understanding and appreciation of science, and attract new entrants to a rewarding career in scientific research. As an example drawn from Computer Science, I revive an old challenge: the construction and application of a verifying compiler that guarantees correctness of a program before running it. A verifying compiler uses automated mathematical and logical reasoning methods to check the correctness of the programs that it compiles. The criterion of correctness is specified by types, assertions, and other redundant annotations that are associated with the code of the program, often inferred automatically, and increasingly often supplied by the original programmer. The compiler will work in combination with other program development and testing tools, to achieve any desired degree of confidence in the structural soundness of the system and the total correctness of its more critical components. The only limit to its use will be set by an evaluation of the cost and benefits of accurate and complete formalization of the criterion of correctness for the software. H. Kosch, L. Böszörményi, H. Hellwagner (Eds.): Euro-Par 2003, LNCS 2790, p. 1, 2003. c

international conference on supercomputing | 2005

affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system

Henrik Löf; Sverker Holmgren

The non-uniform memory access times of modern cc-NUMA systems often impair performance for shared memory applications. This is especially true for applications exhibiting complex access patterns. To improve performance, a mechanism for co-locating threads and data during the execution is needed. In this paper, we study how an affinity-on-next-touch procedure can be used to attain this goal. Such a procedure can increase thread-data affinity by migrating data across nodes to better match the access pattern. The migration is triggered by a directive and it can often be implemented as a re-invocation of a standard first-touch page placement procedure. We study an industrial-class scientific application where the thread-data affinity is small due to serial initializations of data structures accessed indirectly. Adding a single affinity-on-next-touch directive, we observed a performance improvement of 69% for 22 threads. We also perform experiments to study the scalability of the affinity-on-next-touch procedure. Our results indicate that the overhead associated with the procedure is highly dependent on the efficiency of the mechanism used to keep TLBs consistent. Using larger but fewer memory pages in the virtual memory sub-system we measured a total performance improvement of 166% compared to the original code.

international conference on supercomputing | 2006

Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors

Dan Wallin; Henrik Löf; Erik Hagersten; Sverker Holmgren

Efficient solution of partial differential equations require a match between the algorithm and the target architecture. Many recent chip multiprocessors, CMPs (a.k.a. multi-core), feature low intra-thread communication costs and smaller per-thread caches compared to previous shared memory multi-processor systems. From an algorithmic point of view this means that data locality issues become more important than communication overheads. A fact that may require a re-evaluation of many existing algorithms.We have investigated parallel implementations of multi-grid methods using a parallel temporally blocked, naturally ordered smoother. Compared to the standard multigrid solution based on a red-black ordering, we improve the data locality often as much as ten times, while our use of a fine-grained locking scheme keeps the parallel efficiency high.Our algorithm was initially inspired by CMPs and it was surprising to see that our OpenMP multigrid implementation ran up to 40 percent faster than the standard red-black algorithm on a contemporary 8-way SMP system. Thanks to the temporal blocking introduced, our smoother implementation often allowed us to apply the smoother two times at the same cost as a single application of a red-black smoother. By executing our smoother on a 32-thread UltraSPARC T1 (Niagara) SMT/CMP and a simulated 32-way CMP we demonstrate that such architectures can tolerate the increased communication costs implied by the tradeoffs made in our implementation.

international workshop on openmp | 2005

Geographical locality and dynamic data migration for OpenMP implementations of adaptive PDE solvers

Markus Nordén; Henrik Löf; Jarmo Rantakokko; Sverker Holmgren

On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement. The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing necessary. Due to the dynamically changing memory access pattern caused by the runtime adaption, it is a challenging task to achieve a high degree of geographical locality. The main conclusions of the study are: (1) that geographical locality is very important for the performance of the solver, (2) that the performance can be improved significantly using dynamic page migration of misplaced data, (3) that a migrate-on-next-touch directive works well whereas the first-touch strategy is less advantageous for programs exhibiting a dynamically changing memory access patterns, and (4) that the overhead for such migration is low compared to the total execution time.

International Journal of Parallel Programming | 2007

Dynamic data migration for structured AMR solvers

Markus Nordén; Henrik Löf; Jarmo Rantakokko; Sverker Holmgren

On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement (AMR). The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing necessary. Due to the dynamically changing memory access pattern caused by the runtime adaption, it is a challenging task to achieve a high degree of geographical locality. The main conclusions of the study are: (1) that geographical locality is very important for the performance of the solver, (2) that the performance can be improved significantly using dynamic page migration of misplaced data, (3) that a migrate-on-next-touch directive works well whereas the first-touch strategy is less advantageous for programs exhibiting a dynamically changing memory access patterns, and (4) that the overhead for such migration is low compared to the total execution time.

international conference on computational science | 2004

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Henrik Löf; Markus Nordén; Sverker Holmgren

On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality, as one aspect of the non-uniformity is the physical distance between the cc-NUMA nodes. We compare the well established first-touch strategy to an application-initiated page migration strategy as means of increasing the geographical locality for a set of important scientific applications.

International Journal of Parallel, Emergent and Distributed Systems | 2006

Algorithmic Optimizations of a Conjugate Gradient Solver on Shared Memory Architectures

Henrik Löf; Jarmo Rantakokko

OpenMP is an architecture-independent language for programming in the shared memory model. OpenMP is designed to be simple and powerful in terms of programming abstractions. Unfortunately, the architecture-independent abstractions sometimes come with the price of low parallel performance. This is especially true for applications with an unstructured data access pattern running on distributed shared memory systems (DSM). Here, proper data distribution and algorithmic optimizations play a vital role for performance. In this article, we have investigated ways of improving the performance of an industrial class conjugate gradient (CG) solver, implemented in OpenMP running on two types of shared memory systems. We have evaluated bandwidth minimization, graph partitioning and reformulations of the original algorithm reducing global barriers. By a detailed analysis of barrier time and memory system performance, we found that bandwidth minimization is the most important optimization reducing both L2 misses and remote memory accesses. On a uniform memory system, we get perfect scaling. On a NUMA system, the performance is significantly improved with the algorithmic optimizations leaving the system dependent global reduction operations as a bottleneck.

computational science and engineering | 2009

Reconsidering algorithms for iterative solvers in the multicore era

Dan Wallin; Henrik Löf; Erik Hagersten; Sverker Holmgren

Efficient solution of computational problems require a match between the algorithm and the underlying architecture. New multicore processors feature low intra-chip communication cost and smaller per-thread caches compared to single-core implementations, indicating that data locality issues are more important than communication overheads. We investigate the impact of these changes on parallel multigrid methods. We present a temporally blocked, naturally ordered, smoother implementation that improves the data locality as much as ten times compared with the standard red-black algorithm. We present results of the performance of our new algorithm on an SMP system, an UltraSPARC T1 (Niagara) SMT/CMP, and a simulated CMP processor.

Archive | 2003