Dan Wallin
Uppsala University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dan Wallin.
international conference on supercomputing | 2006
Dan Wallin; Henrik Löf; Erik Hagersten; Sverker Holmgren
Efficient solution of partial differential equations require a match between the algorithm and the target architecture. Many recent chip multiprocessors, CMPs (a.k.a. multi-core), feature low intra-thread communication costs and smaller per-thread caches compared to previous shared memory multi-processor systems. From an algorithmic point of view this means that data locality issues become more important than communication overheads. A fact that may require a re-evaluation of many existing algorithms.We have investigated parallel implementations of multi-grid methods using a parallel temporally blocked, naturally ordered smoother. Compared to the standard multigrid solution based on a red-black ordering, we improve the data locality often as much as ten times, while our use of a fine-grained locking scheme keeps the parallel efficiency high.Our algorithm was initially inspired by CMPs and it was surprising to see that our OpenMP multigrid implementation ran up to 40 percent faster than the standard red-black algorithm on a contemporary 8-way SMP system. Thanks to the temporal blocking introduced, our smoother implementation often allowed us to apply the smoother two times at the same cost as a single application of a red-black smoother. By executing our smoother on a 32-thread UltraSPARC T1 (Niagara) SMT/CMP and a simulated 32-way CMP we demonstrate that such architectures can tolerate the increased communication costs implied by the tradeoffs made in our implementation.
Parallel Algorithms and Applications | 2002
Sverker Holmgren; Markus Nordén; Jarmo Rantakokko; Dan Wallin
Abstract The performance of shared-memory (OpenMP) implementations of three different PDE solver kernels representing finite difference methods, finite volume methods and spectral methods has been investigated. The experiments have been performed on a self-optimizing NUMA system, the Sun Orange prototype, using different data placement and thread scheduling strategies. The results show that correct data placement is very important for the performance for all solvers. However, the Orange system has a unique capability of automatically changing the data distribution at run time through both migration and replication of data. For reasonable large PDE problems, we find that the time to do this is negligible compared to the total solve time. Also, the performance after the migration and replication process has reached steady-state is the same as what is achieved if data is optimally placed at the beginning of the execution using hand tuning. This shows that, for the application studied, the self-optimizing features are successful, and shared memory code without explicit data distribution directives yields good performance.
international parallel and distributed processing symposium | 2003
Dan Wallin; Erik Hagersten
While prefetch has proven itself useful for reducing cache misses in multiprocessors, traffic is often increased due to extra unused prefetch data. Prefetching in multiprocessors can also increase the cache miss rate due to the false sharing caused by the larger pieces of data retrieved. The capacity prefetching strategy proposed in this paper is built on the assumption that prefetching is most beneficial for reducing capacity and cold misses, but not communication misses. We propose a simple scheme for detecting the most frequent communication misses and suggest that prefetching should be avoided for those. We also suggest a simple and effective strategy for reducing the address traffic while retrieving many sequential cache lines called bundling. In order to demonstrate the effectiveness of these approaches, we have evaluated both strategies for one of the simplest forms of prefetching, sequential prefetching. The two new strategies applied to this bandwidth-hungry prefetch technique result in a lower miss rate for all studied applications, while the average amount of address traffic is reduced compared with the same application run with no prefetching. The proposed strategies could also be applied to more sophisticated prefetching techniques for better overall performance.
parallel computing | 2003
Dan Wallin; Henrik Johansson; Sverker Holmgren
Publisher Summary This chapter discusses three different partial differential equation (PDE) solver kernels in respect to cache memory performance on a simulated shared memory computer. The kernels implement state-of-the-art solution algorithms for complex application problems and the simulations are performed for data sets of realistic size. The performance of the studied applications benefits from much longer cache lines than normally found in commercially available computer systems. The reason for this is that, numerical algorithms are carefully coded and have regular memory access patterns. These programs take advantage of spatial locality and the amount of false sharing is limited. A simple sequential hardware prefetch strategy, providing cache behavior similar to a large cache line, could potentially yield large performance gains for these applications. Unfortunately, such prefetchers often lead to additional address snoops in multiprocessor caches. However, applying a bundle technique that lumps several read address transactions together, this large increase in address snoops can be avoided. For all studied algorithms, both the address snoops and cache misses are largely reduced in the bundled prefetch protocol.
european conference on parallel processing | 2001
Sverker Holmgren; Dan Wallin
High-accuracy PDE solvers use multi-dimensional fast Fourier transforms. The FFTs exhibits a static and structured memory access pattern which results in a large amount of communication. Performance analysis of a non-trivial kernel representing a PDE solution algorithm has been carried out on a Sun WildFire computer. Here, different architecture, system and programming models can be studied. The WildFire system uses self-optimization techniques such as data migration and replication to change the placement of data at runtime. If the data placement is not optimal, the initial performance is degraded. However, after a few iterations the page migration daemon is able to modify the placement of data. The performance is improved, and equals what is achieved if the data is optimally placed at the start of the execution using hand tuning. The speedup for the PDE solution kernel is surprisingly good.
international parallel and distributed processing symposium | 2004
Dan Wallin; Erik Hagersten
Summary form only given. Prefetching has proven to be a useful technique for reducing cache misses in multiprocessors at the cost of increased coherence traffic. This is especially trouble some for snoop-based systems, where the available coherence bandwidth often is the scalability bottleneck. The bundling technique reduces the overhead caused by prefetching in two ways: piggybacking prefetches with normal requests, and requiring only one device to perform the snoop lookup for each prefetch transaction. This can reduce both the address bandwidth and the number of snoop lookups compared with a nonprefetching system. We describe bundling implementations for two important transaction types: reads and upgrades. While bundling could reduce the overhead of most existing prefetch schemes, the evaluation of bundling performed has been limited to two of them: sequential prefetching and Dahlgrens adaptive sequential prefetching. Both schemes have their snoop bandwidth halved for all commercial and scientific benchmarks in the study. The combined effect of bundling applied to these prefetch schemes lowers the cache miss rate, the address bandwidth and the snoop bandwidth, compared with a system with no prefetching, for all applications. Bundling, will not reduce the data bandwidth introduced by a prefetch scheme. However, we argue that the data bandwidth is more easily scaled than the snoop bandwidth for snoop-based coherence systems.
computational science and engineering | 2009
Dan Wallin; Henrik Löf; Erik Hagersten; Sverker Holmgren
Efficient solution of computational problems require a match between the algorithm and the underlying architecture. New multicore processors feature low intra-chip communication cost and smaller per-thread caches compared to single-core implementations, indicating that data locality issues are more important than communication overheads. We investigate the impact of these changes on parallel multigrid methods. We present a temporally blocked, naturally ordered, smoother implementation that improves the data locality as much as ten times compared with the standard red-black algorithm. We present results of the performance of our new algorithm on an SMP system, an UltraSPARC T1 (Niagara) SMT/CMP, and a simulated CMP processor.
parallel computing | 2004
Henrik Johansson; Dan Wallin; Sverker Holmgren
By simulating a real computer it is possible to gain a detailed knowledge of the cache memory utilization of an application, e.g., a partial differential equation (PDE) solver. Using this knowledge, we can discover regions with intricate cache memory performance. Furthermore, this information makes it possible to identify performance bottlenecks. In this paper, we employ full system simulation of a shared memory computer to perform a case study of three different PDE solver kernels with respect to cache memory performance. The kernels implement state-of-the-art solution algorithms for complex application problems and the simulations are performed for data sets of realistic size. We discovered interesting properties in the solvers, which can help us to improve their performance in the future.
parallel and distributed computing systems (isca) | 2005
Dan Wallin; Håkan Zeffer; Martin Karlsson; Erik Hagersten
Archive | 2004
Dan Wallin; Erik Hagersten