Is this you? Create Your Porfile

Per Stenström

Chalmers University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Per Stenström is active.

Explore More

Publication

Featured researches published by Per Stenström.

ACM Transactions in Embedded Computing Systems | 2008

The worst-case execution-time problem—overview of methods and survey of tools

Reinhard Wilhelm; Jakob Engblom; Andreas Ermedahl; Niklas Holsti; Stephan Thesing; David B. Whalley; Guillem Bernat; Christian Ferdinand; Reinhold Heckmann; Tulika Mitra; Frank Mueller; Isabelle Puaut; Peter P. Puschner; Jan Staschulat; Per Stenström

The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools1 and research prototypes.

IEEE Computer | 1990

A survey of cache coherence schemes for multiprocessors

Per Stenström

Schemes for cache coherence that exhibit various degrees of hardware complexity, ranging from protocols that maintain coherence in hardware, to software policies that prevent the existence of copies of shared, writable data, are surveyed. Some examples of the use of shared data are examined. These examples help point out a number of performance issues. Hardware protocols are considered. It is seen that consistency can be maintained efficiently, although in some cases with considerable hardware complexity, especially for multiprocessors with many processors. Software schemes are investigated as an alternative capable of reducing the hardware cost. >

european conference on parallel processing | 2000

Parallel Computer Architecture

Silvia M. Müller; Per Stenström; Mateo Valero; Stamatis Vassiliadis

Computer architecture is a truly fascinating field in tha improvements in the basic echnology and innovations how to make best use of he underlying tech- nology has yielded a performance growth exceeding a million times over the past 50 years. What is even more amazing is the fact that he pressure on maintain- ing his rate of performance growth shows no decline. In fact, as performance thresholds are passed, application designers face new opportunities tha give new challenging problems to work on for computer architects.

real time systems symposium | 1999

Timing anomalies in dynamically scheduled microprocessors

Thomas Lundqvist; Per Stenström

Previous timing analysis methods have assumed that the worst-case instruction execution time necessarily corresponds to the worst-case behavior. We show that this assumption is wrong in dynamically scheduled processors. A cache miss, for example, can in some cases result in a shorter execution time than a cache hit. Many examples of such timing anomalies are provided. We first provide necessary conditions when timing anomalies can show up and identify what architectural features that may cause such anomalies. We also show that analyzing the effect of these anomalies with known techniques results in prohibitive computational complexities. Instead, we propose some simple code modification techniques to make it impossible for any anomalies to occur. These modifications make it possible to estimate WCET by known techniques. Our evaluation shows that the pessimism imposed by these techniques is fairly limited; it is less than 27% for the programs in our benchmark suite.

international symposium on computer architecture | 1993

An adaptive cache coherence protocol optimized for migratory sharing

Per Stenström; Mats Brorsson; Lars Sandberg

Parallel programs that use critical sections and are executed on a shared-memory multiprocessor with a write-invalidate protocol result in invalidation actions that could be eliminated. For this type of sharing, called migratory sharing, each processor typically causes a cache miss followed by an invalidation request which could be merged with the preceding cache-miss request. In this paper we propose an adaptive protocol that invokes this optimization dynamically for migratory blocks. For other blocks, the protocol works as an ordinary write-invalidate protocol. We show that the protocol is a simple extension to a write-invalidate protocol. Based on a program-driven simulation model of an architecture similar to the Stanford DASH, and a set of four benchmarks, we evaluate the potential performance improvements of the protocol. We find that it effectively eliminates most single invalidations which improves the performance by reducing the shared access penalty and the network traffic.

IEEE Transactions on Parallel and Distributed Systems | 1995

Sequential hardware prefetching in shared-memory multiprocessors

Fredrik Dahlgren; Michel Dubois; Per Stenström

To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively. >

international conference on parallel processing | 1993

Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors

Fredrik Dahlgren; Michel Dubois; Per Stenström

To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware tech niques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the exe cution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show significant reductions of the read penalty and of the overall execution time.

high-performance computer architecture | 2007

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

Haakon Dybdahl; Per Stenström

The significant speed-gap between processor and memory and the limited chip memory bandwidth make last-level cache performance crucial for future chip multiprocessors. To use the capacity of shared last-level caches efficiently and to allow for a short access time, proposed non-uniform cache architectures (NUCAs) are organized into per-core partitions. If a core runs out of cache space, blocks are typically relocated to nearby partitions, thus managing the cache as a shared cache. This uncontrolled sharing of all resources may unfortunately result in pollution that degrades performance. We propose a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically. The adaptive scheme estimates, continuously, the effect of increasing/decreasing the shared partition size on the overall performance. We show that our scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions

Real-time Systems | 1999

An Integrated Path and Timing Analysis Method based on Cycle-Level Symbolic Execution

Thomas Lundqvist; Per Stenström

Previously published methods for estimation of the worst-case execution time on high-performance processors with complex pipelines and multi-level memory hierarchies result in overestimations owing to insufficient path and/or timing analysis. This does not only give rise to poor utilization of processing resources but also reduces the schedulability in real-time systems. This paper presents a method that integrates path and timing analysis to accurately predict the worst-case execution time for real-time programs on high-performance processors. The unique feature of the method is that it extends cycle-level architectural simulation techniques to enable symbolic execution with unknown input data values; it uses alternative instruction semantics to handle unknown operands. We show that the method can exclude many infeasible (or non-executable) program paths and can calculate path information, such as bounds on number of loop iterations, without the need for manual annotations of programs. Moreover, the method is shown to accurately analyze timing properties of complex features in high-performance processors using multiple-issue pipelines and instruction and data caches. The combined path and timing analysis capability is shown to derive exact estimates of the worst-case execution time for six out of seven programs in our benchmark suite.

high performance computer architecture | 2000

A prefetching technique for irregular accesses to linked data structures

Magnus Karlsson; Fredrik Dahlgren; Per Stenström

Prefetching offers the potential to improve the performance of linked data structure (LDS) traversals. However, previously proposed prefetching methods only work well when there is enough work processing a node that the prefetch latency can be hidden, or when the LDS is long enough and the traversal path is known a priori. This paper presents a prefetching technique called prefetch arrays which can prefetch both short LDS, as the lists found in hash tables, and trees when the traversal path is nor known a priori. We offer two implementations, one software-only and one which combines software annotations with a prefetch engine in hardware. On a pointer-intensive benchmark suite, we show that our implementations reduce the memory stall lime by 23% to 51% for the kernels with linked lists, while the other prefetching methods cause reductions that are substantially less. For binary-trees, our hardware method manages to cut nearly 60% of the memory stall time even when the traversal path is not known a priori. However, when the branching factor of the tree is too high, our technique does not improve performance. Another contribution of the paper is that we quantify pointer-chasing found in interesting applications such as OLTP, Expert Systems, DSS, and JAVA codes and discuss which prefetching techniques are relevant to use in each case.

Explore More