Fredrik Dahlgren | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fredrik Dahlgren is active.

Explore More

Publication

Featured researches published by Fredrik Dahlgren.

IEEE Transactions on Parallel and Distributed Systems | 1995

Sequential hardware prefetching in shared-memory multiprocessors

Fredrik Dahlgren; Michel Dubois; Per Stenström

To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively. >

international conference on parallel processing | 1993

Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors

Fredrik Dahlgren; Michel Dubois; Per Stenström

To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware tech niques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the exe cution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show significant reductions of the read penalty and of the overall execution time.

high performance computer architecture | 2000

A prefetching technique for irregular accesses to linked data structures

Magnus Karlsson; Fredrik Dahlgren; Per Stenström

Prefetching offers the potential to improve the performance of linked data structure (LDS) traversals. However, previously proposed prefetching methods only work well when there is enough work processing a node that the prefetch latency can be hidden, or when the LDS is long enough and the traversal path is known a priori. This paper presents a prefetching technique called prefetch arrays which can prefetch both short LDS, as the lists found in hash tables, and trees when the traversal path is nor known a priori. We offer two implementations, one software-only and one which combines software annotations with a prefetch engine in hardware. On a pointer-intensive benchmark suite, we show that our implementations reduce the memory stall lime by 23% to 51% for the kernels with linked lists, while the other prefetching methods cause reductions that are substantially less. For binary-trees, our hardware method manages to cut nearly 60% of the memory stall time even when the traversal path is not known a priori. However, when the branching factor of the tree is too high, our technique does not improve performance. Another contribution of the paper is that we quantify pointer-chasing found in interesting applications such as OLTP, Expert Systems, DSS, and JAVA codes and discuss which prefetching techniques are relevant to use in each case.

annual simulation symposium | 1993

The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors

Mats Brorsson; Fredrik Dahlgren; Håkan Nilsson; Per Stenström

The CacheMire Test Bench – A Flexible and Effective Approach for Simulation of Multiprocessors

IEEE Transactions on Parallel and Distributed Systems | 1996

Evaluation of hardware-based stride and sequential prefetching in shared-memory multiprocessors

Fredrik Dahlgren; Per Stenström

We study the efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does as well as and in same cases even better than stride prefetching for five applications. This is because 1) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for these stride accesses, and 2) sequential prefetching also exploits the locality of read misses with nonstride accesses. However, since stride prefetching in general results in fewer useless prefetches, it offers the extra advantage of consuming less memory-system bandwidth.

high performance computer architecture | 1995

Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors

Fredrik Dahlgren; Per Stenström

We study the relative efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does better than stride prefetching for three applications. This is because (i) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for stride accesses, and (ii) sequential prefetching also exploits the locality of read misses for non-stride accesses. However we find that since stride prefetching causes fewer useless prefetches, it consumes less memory-system bandwidth.<<ETX>>

Journal of Parallel and Distributed Computing | 1995

Using write caches to improve performance of cache coherence protocols in shared-memory multiprocessors

Fredrik Dahlgren; Per Stenström

Abstract Write-invalidate protocols suffer from memory-access penalties due to coherence misses. While write-update or hybrid update/invalidate protocols can reduce coherence misses, the update traffic can increase memory-system contention. We show in this paper that update-based cache protocols can perform significantly better than write-invalidate protocols by incorporating a write cache in each processing node. Because it is legal to delay the propagation of modifications of a block until the next synchronization under relaxed memory consistency models, a write cache can significantly reduce traffic by exploiting locality in write accesses. By concentrating on a cache-coherent NUMA architecture, we study the implementation aspects of augmenting a write-invalidate, a write-update and two hybrid update/invalidate protocols with write caches. Through detailed architectural simulations using five benchmark programs, we find that write caches, with only a few blocks each, help write-invalidate protocols to cut the false-sharing miss rate and hybrid update/invalidate protocols to keep other copies, including the memory copy, clean at an acceptable write traffic level. Overall, the memory-access penalty associated with coherence misses is drastically reduced.

international symposium on computer architecture | 1994

Combined performance gains of simple cache protocol extensions

Fredrik Dahlgren; Michel Dubois; Per Stenström

We consider three simple extensions to directory-based cache coherence protocols in shared-memory multiprocessors. These extensions are aimed at reducing the penalties associated with memory accesses and include a hardware prefetching scheme, a migratory sharing optimization, and a competitive-update mechanism. Since they target different components of the read and write penalties, they can be combined effectively.Detailed architectural simulations using five benchmarks show substantial combined performance gains obtained at a modest additional hardware cost. Prefetching in combination with competitive-update is the best combination under release consistency in systems with sufficient network bandwidth. By contrast, prefetching plus the migratory sharing optimization is advantageous under sequential consistency and/or in systems with limited network bandwidth.

international symposium on computer architecture | 1995

Boosting the performance of hybrid snooping cache protocols

Fredrik Dahlgren

Previous studies of bus-based shared-memory multiprocessors have shown hybrid write-invalidate/write-update snooping protocols to be incapable of providing consistent performance improvements over write-invalidate protocols. We analyze the deficiencies of hybrid snooping protocols under release consistency, and show how these deficiencies can be dramatically reduced by using write caches and read snarfing. Our performance evaluation is based on program-driven simulation and a set of five scientific applications with different sharing behaviors including migratory sharing as well as producer-consumer sharing. We show that a hybrid protocol, extended with write caches as well as read snarfing, manages to reduce the number of coherence misses by between 83% and 95% as compared to a write-invalidate protocol for all five applications in this study. In addition the number of bus transactions is reduced by between 36% and 60% for four of the applications and by 9% for the fifth application. Because of the small implementation cost of the hybrid protocol and the two extensions, we believe that this combination is an effective approach to boost the performance of bus-based multiprocessors.

IEEE Computer | 1997

Boosting the performance of shared memory multiprocessors

Per Stenström; Mats Brorsson; Fredrik Dahlgren; Håkan Grahn; Michel Dubois

Proposed hardware optimizations to CC-NUMA machines-shared memory multiprocessors that use cache consistency protocols-can shorten the time processors lose because of cache misses and invalidations. The authors look at cost-performance trade-offs for each.

Explore More