Jonas Skeppstedt | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jonas Skeppstedt is active.

Explore More

Publication

Featured researches published by Jonas Skeppstedt.

international symposium on computer architecture | 1993

The detection and elimination of useless misses in multiprocessors

Michel Dubois; Jonas Skeppstedt; Livio Ricciulli; Krishnan Ramamurthy; Per Stenström

In this paper we introduce a new classification of misses in shared-memory multiprocessors based on interprocessor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting the correctness of program execution. Based on the new classification we compare the effectiveness of five different protocols which delay and combine invalidations leading to useless misses. In cache-based systems the protocols are very effective and have miss rates close to the essential miss rate. In virtual shared memory systems the techniques are also effective but leave room for improvements.

architectural support for programming languages and operating systems | 1994

Simple compiler algorithms to reduce ownership overhead in cache coherence protocols

Jonas Skeppstedt; Per Stenström

We study in this paper the design and efficiency of compiler algorithms that remove ownership overhead in shared-memory multiprocessors with write-invalidate protocols. These algorithms detect loads followed by stores to the same address. Such loads are marked and constitute a hint to the cache to obtain an exclusive copy of the block. We consider three algorithms where the first one focuses on load-store sequences within each basic block of code and the other two analyse the existence of load-store sequences across basic blocks at the intra-procedural level. Since the dataflow analysis we adopt is a trivial variation of live-variable analysis, the algorithms are easily incorporated into a compiler. Through detailed simulations of a cache-coherent NUMA architecture using five scientific parallel benchmark programs, we find that the algorithms are capable of removing over 95% of the separate ownership acquisitions. Moreover, we also find that even the simplest algorithm is comparable in efficiency with previously proposed hardware-based adaptive cache coherence protocols to attack the same problem.

Journal of Parallel and Distributed Computing | 1995

Essential misses and data traffic in coherence protocols

Michel Dubois; Jonas Skeppstedt; Per Strenström

In this paper, we introduce a classification of misses and of components of the data traffic in shared-memory multiprocessors based on interprocessor communication. We consider protocols with invalidations, updates, and prefetches in systems with infinite and finite caches. We identify the set of essential misses and the essential traffic, i.e., the smallest set of misses and the smallest amount of traffic necessary for correct execution. The rest of the misses and of the data traffic is nonessential and could be ignored without affecting the correctness of program execution. To illustrate the classification of misses and traffic, we apply it to a set of parallel scientific programs and observe the overhead created by different hardware mechanisms when block sizes and cache sizes are varied.

international conference on parallel processing | 1997

Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps

Jonas Skeppstedt; Michel Dubois

We propose and evaluate a new data prefetching technique for cache coherent multiprocessors. Prefetches are issued by a prefetch engine which is controlled by the compiler. Second-level cache misses generate cache miss traps, and start the prefetch engine in a trap handler generated by the compiler. The only instruction overhead in our approach is when a trap handler terminates after data arrives. We present the functionality of the prefetch engine and a compiler algorithm to control it. We also study emulation of the prefetch engine in software. Our techniques are evaluated on six parallel applications using a compiler which incorporates our algorithm and a simulated multiprocessor. The prefetch engines remove up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engines remove in general less stall time at a higher instruction overhead.

ACM Transactions on Programming Languages and Systems | 1996

Using dataflow analysis techniques to reduce ownership overhead in cache coherence protocols

Jonas Skeppstedt; Per Stenström

In this article, we explore the potential of classical dataflow analysis techniques in removing overhead in write-invalidate cache coherence protocols for shared-memory multiprocessors. We construct the compiler algorithms with varying degree of sophistication that detect loads followed by stores to the same address. Such loads are marked and constitute a hint to the cache to obtain an exclusive copy of the block so that the subsequent store does not introduce access penalties. The simplest of the three compiler algorithms analyzes the existence of load-store sequences within each basic blocks of code whereas the other two analyze load-store sequences across basic blocks at the intraprocedural level. The algorithms have been incorporated into an optimizing C compiler, and we have evaluated their efficiencies by compiling and executing seven parallel programs on a simulated multiprocessor. Our results show that the detection efficiency of the most aggressive algorithm is 96% or higher for four of the seven programs studied. We also compare the efficiency of these static algorithms with that of dynamic hardware-based algorithms that reduce ownership overhead. We find that the static analysis using classical dataflow analysis results in similar performance improvements as dynamic hardware-based approaches.

Journal of Parallel and Distributed Computing | 2000

Compiler Controlled Prefetching for Multiprocessors Using Low-Overhead Traps and Prefetch Engines

Jonas Skeppstedt; Michel Dubois

In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40?50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead.

international conference on parallel architectures and compilation techniques | 1997

Overcoming limitations of prefetching in multiprocessors by compiler-initiated coherence actions

Jonas Skeppstedt

In this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, and this introduces a major limitation. We consider combining prefetch with two other compiler-controlled techniques to partly remedy the situation: (1) load-exclusive to reduce write-latency and (2) store-update to reduce read-latency. The purpose of each of these techniques in a machine with prefetch is to let them reduce latency for accesses which the prefetch technique could not handle. We evaluate two different scenarios, firstly with a hybrid compiler/hardware prefetch technique and secondly with an optimal stride-prefetcher. We find that the combined gains under the hybrid prefetch technique are significant for six applications we have studied: in average, 71% of the original write-stall time remains after using the hybrid prefetcher, and of these ownership-requests, 60% would be eliminated using load-exclusive; in average, 68% of the read-stall time remains after using the hybrid prefetcher and of these read-misses, 34% were serviced by remote caches and would be converted by store-update into misses serviced by a clean copy in memory which reduces the read-latency. With an optimal stride-prefetcher our results show that it beneficient to complement prefetch, with the two techniques here as well.

Microprocessors and Microsystems | 1996

The design of a non-blocking load processor architecture

Per Stenström; Magnus Balldin; Jonas Skeppstedt

Abstract We have extended a single-issue pipelined implementation of SPARC with mechanisms to support non-blocking load instructions and analyzed it with respect to speed and complexity. We present the functionality of the non-blocking load scheme as well as a detailed implementation analysis of it. We find that it is possible to implement the non-blocking load mechanisms without significantly complicating the pipeline design and with no increase of the processor cycle time. This is mainly because the non-blocking load mechanisms can work in parallel with the ALU, the registerfile, and the cache memories-datapath components that often establish the critical path in a pipelined processor.

Journal of Parallel and Distributed Computing | 1999

Evaluation of Compiler-Controlled Updating to Reduce Coherence-Miss Penalties in Shared-Memory Multiprocessors

Jonas Skeppstedt; Fredrik Dahlgren; Per Stenström

We consider in this paper the effectiveness of a new approach calledcompiler-controlledupdating to reduce coherence-miss penalties in shared-memory multiprocessors. A key part of the method is a compiler algorithm that identifies the last store instruction to a memory block in a flow graph using classic dataflow analysis techniques. Such stores are marked and replaced by update instructions that at run time make the memory copy clean. Whereas this static method shortens the read-miss latency for actively shared blocks, it can cause useless traffic for shared blocks that are effectively private. We therefore complement the static analysis with a dynamic simple heuristic in the cache coherence protocol aiming at classifying blocks as private or shared at run time. We evaluate the performance effects of compiler-controlled updating using six scientific parallel applications compiled by an optimizing compiler that incorporates our static analysis and then running them on a detailed CC-NUMA architectural simulation model. We have found that the compiler algorithm can convert between 83 and 100% of the dirty misses into clean misses. By adding the private/shared heuristic, the update traffic of private memory blocks can be practically eliminated. Overall, the static analysis in combination with the dynamic heuristic is shown to reduce the execution time by as much as 32%.

asilomar conference on signals, systems and computers | 2014

Finding fast action selectors for dataflow actors

Gustav Cedersjö; Jorn W. Janneck; Jonas Skeppstedt

The parallel structure of dataflow programs and their support for processing streams of data make dataflow programming an interesting tool for doing stream processing on parallel processing architectures. The computational kernels, the actors, of a dataflow program communicate with other actors via FIFO channels. The actors in the dataflow model used in this paper may perform different actions depending on the state of the actor and on the data that has been sent to the actor that is present on its ingoing channels. For this kind of dataflow programs, decisions on what to do in an actor in a given state has to be made at runtime in a process called action selection. Each action is associated with a set of conditions on the state and the input channels. All conditions must be fulfilled for the action to be selected, and the task of the action selector is to test different conditions to select an action. This paper builds upon previous work on the actor machine - a machine model for dataflow actors where the action selection is central. We present two heuristics that based on profiling data creates fast action selectors using the actor machine. The heuristics are implemented in the Tÿcho Dataflow Compiler and are evaluated using a video decoder written in Cal.

Explore More