Erik Hagersten | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Erik Hagersten is active.

Explore More

Publication

Featured researches published by Erik Hagersten.

european conference on parallel processing | 2003

THROOM: Supporting POSIX multithreaded binaries on a cluster

Henrik Löf; Zoran Radovic; Erik Hagersten

I propose a set of criteria which distinguish a grand challenge in science or engineering from the many other kinds of short-term or long-term research problems that engage the interest of scientists and engineers. The primary purpose of the formulation and promulgation of a grand challenge is to contribute to the advancement of some branch of science or engineering. A grand challenge represents a commitment by a significant section of the research community to work together towards a common goal, agreed to be valuable and achievable by a team effort within a predicted timescale. The challenge is formulated by the researchers themselves as a focus for the research that they wish to pursue in any case, and which they believe can be pursued more effectively by advance planning and co-ordination. Unlike other common kinds of research initiative, a grand challenge should not be triggered by hope of short-term economic, commercial, medical, military or social benefits; and its initiation should not wait for political promotion or for prior allocation of special funding. The goals of the challenge should be purely scientific goals of the advancement of skill and of knowledge. It should appeal not only to the curiosity of scientists and to the ambition of engineers; ideally it should appeal also to the imagination of the general public; thereby it may enlarge the general understanding and appreciation of science, and attract new entrants to a rewarding career in scientific research. As an example drawn from Computer Science, I revive an old challenge: the construction and application of a verifying compiler that guarantees correctness of a program before running it. A verifying compiler uses automated mathematical and logical reasoning methods to check the correctness of the programs that it compiles. The criterion of correctness is specified by types, assertions, and other redundant annotations that are associated with the code of the program, often inferred automatically, and increasingly often supplied by the original programmer. The compiler will work in combination with other program development and testing tools, to achieve any desired degree of confidence in the structural soundness of the system and the total correctness of its more critical components. The only limit to its use will be set by an evaluation of the cost and benefits of accurate and complete formalization of the criterion of correctness for the software. H. Kosch, L. Böszörményi, H. Hellwagner (Eds.): Euro-Par 2003, LNCS 2790, p. 1, 2003. c

IEEE Computer | 1992

DDM-a cache-only memory architecture

Erik Hagersten; Anders Landin; Seif Haridi

The Data Diffusion Machine (DDM), a cache-only memory architecture (COMA) that relies on a hierarchical network structure, is described. The key ideas behind DDM are introduced by describing a small machine, which could be a COMA on its own or a subsystem of a larger COMA, and its protocol. A large machine with hundreds of processors is also described. The DDM prototype project is discussed, and simulated performance results are presented. >

international parallel processing symposium | 1994

Queue locks on cache coherent multiprocessors

Peter S. Magnusson; Anders Landin; Erik Hagersten

Large-scale shared-memory multiprocessors typically have long latencies for remote data accesses. A key issue for execution performance of many common applications is the synchronization cost. The communication scalability of synchronization has been improved by the introduction of queue-based spin-locks instead of Test&(Test&Set). For architectures with long access latencies for global data, attention should also be paid to the number of global accesses that are involved in synchronization. We present a method to characterize the performance of proposed queue lock algorithms, and apply it to previously published algorithms. We also present two new queue locks, the LH lock and the M lock. We compare the locks in terms of performance, memory requirements, code size and required hardware support. The LH lock is the simplest of all the locks, yet requires only an atomic swap operation. The M lock is superior in terms of global accesses needed to perform synchronization and still competitive in all other criteria. We conclude that the M lock is the best overall queue lock for the class of architectures studied.<<ETX>>

high-performance computer architecture | 1999

WildFire: a scalable path for SMPs

Erik Hagersten; Michael J. Koster

Researchers have searched for scalable alternatives to the symmetric multiprocessor (SMP) architecture since it was first introduced in 1982. The paper introduces an alternative view of the relationship between scalable technologies and SMPs. Instead of replacing large SMPs with scalable technology, we propose new scalable techniques that allow large SMPs to be tied together efficiently, while maintaining the compatibility with, and performance characteristics of, an SMP. The trade-offs of such an architecture differ from those of traditional, scalable, Non-Uniform Memory Architecture (cc-NUMA) approaches. WildFire is a distributed shared memory (DSM) prototype implementation based on large SMPs. It relies on two techniques for creating application-transparent locality: Coherent Memory Replication (CMR), which is a variation of Simple COMA/Reactive NUMA, and Hierarchical Affinity Scheduling (HAS). These two optimizations create extra node locality, which blurs the node boundaries to an application such that SMP-like performance can be achieved with no NUMA-specific optimizations. We present a performance study of a large OLTP benchmark running on DSMs built from various sized nodes and with varying amounts of application-transparent locality. WildFires measured performance is shown to be more than two times that of an unoptimized NUMA implementation built from small nodes and within 13% of the performance of the ideal implementation: a large SMP with the same access time to its entire shared memory as the local memory access time of WildFire.

international symposium on performance analysis of systems and software | 2004

StatCache: a probabilistic approach to efficient and accurate data locality analysis

Erik J. Berg; Erik Hagersten

The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads. StatCache is based on a probabilistic model of the cache, rather than a functional cache simulator. It uses statistics from a single run to accurately estimate miss ratios of fully-associative caches of arbitrary sizes and generate working-set graphs. We evaluate StatCache using the SPEC CPU2000 benchmarks and show that StatCache gives accurate results with a sampling rate as low as 10/sup -4/. We also provide a proof-of-concept implementation, and discuss potentially very fast implementation alternatives.

measurement and modeling of computer systems | 2005

Fast data-locality profiling of native execution

Erik J. Berg; Erik Hagersten

Performance tools based on hardware counters can efficiently profile the cache behavior of an application and help software developers improve its cache utilization. Simulator-based tools can potentially provide more insights and flexibility and model many different cache configurations, but have the drawback of large run-time overhead.We present StatCache, a performance tool based on a statistical cache model. It has a small run-time overhead while providing much of the flexibility of simulator-based tools. A monitor process running in the background collects sparse memory access statistics about the analyzed application running natively on a host computer. Generic locality information is derived and presented in a code-centric and/or data-centric view.We evaluate the accuracy and performance of the tool using ten SPEC CPU2000 benchmarks. We also exemplify how the flexibility of the tool can be used to better understand the characteristics of cache-related performance problems.

international symposium on performance analysis of systems and software | 2010

StatStack: Efficient modeling of LRU caches

David Eklov; Erik Hagersten

Efficient execution on modern architectures requires good data locality, which can be measured by the powerful stack distance abstraction. Based on this abstraction, the miss rate for LRU caches of any size can be predicted. However, measuring stack distance requires the number of unique memory objects to be counted between successive accesses to the same data object, which requires complex and inefficient data collection.

international symposium on computer architecture | 1991

Race-free interconnection networks and multiprocessor consistency

Anders Landin; Erik Hagersten; Seif Haridi

Modern shared-memory multiprocmors require complex interconnection networks to provide sufficient communication bandwidth between processors. They also rely on advanced memory systems that allow multiple memory operations to be made in parallel. It is expensive to maintain a high consistency level in a machine based on a general network, but for special interconnection topologies, some of these costs can he reduced. We define and study one class of interconnection networks, race-free networks. New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph. We show that this can be done in racefree networks without the need for a transaction to be globally performed before the next transaction can be issued: We also investigate what is required to maintain processor consistency in race-free networks. In a race-free network which maintains processor consistency, writes may be pipelined, and reads may bypass writes. - The proposed methods reduce the latencies associated with processor write-misses to shared data.

hawaii international conference on system sciences | 1994

Simple COMA node implementations

Erik Hagersten; Ashley Saulsbury; Anders Landin

Shared memory architectures often have caches to reduce the number of slow remote memory accesses. The largest possible caches exist in shared memory architectures called Cache-Only Memory Architectures (COMAs). In a COMA all the memory resources are used to implement large caches. Unfortunately, these large caches also have their price. Due to its lack of physically shared memory, COMA may suffer from a longer remote access latency than alternatives. Large COMA caches might also introduce an extra latency for local memory accesses, unless the node architecture is designed with care. The authors examine the implementation of COMAs, and consider how to move much of the complex functionality into software. They introduce the idea of a simple COMA architecture, a hybrid with hardware support only for the functionality frequently used. Such a system is expected to have good performance, and because of its simplicity it should be quick and cheap to develop and engineer.<<ETX>>

high-performance computer architecture | 2003

Memory system behavior of Java-based middleware

Martin Karlsson; Kevin E. Moore; Erik Hagersten; David A. Wood

In this paper, we present a detailed characterization of the memory system, behavior of ECperf and SPECjbb using both commercial server hardware and Simics full-system simulation. We find that the memory footprint and primary working sets of these workloads are small compared to other commercial workloads (e.g. on-line transaction processing), and that a large fraction of the working sets are shared between processors. We observed two key differences between ECperf and SPECjbb that highlight the importance of isolating the behavior of the middle tier. First, ECperf has a larger instruction footprint, resulting in much higher miss rates for intermediate-size instruction caches. Second, SPECjbbs data set size increases linearly as the benchmark scales up, while ECperfs remains roughly constant. This difference can lead to opposite conclusions on the design of multiprocessor memory systems, such as the utility of moderate sized (i.e. 1 MB) shared caches in a chip multiprocessor.

Explore More