Håkan Zeffer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Håkan Zeffer is active.

Explore More

Publication

Featured researches published by Håkan Zeffer.

international symposium on microarchitecture | 2009

Rock: A High-Performance Sparc CMT Processor

Shailender Chaudhry; Robert E. Cypher; Magnus Ekman; Martin Karlsson; Anders Landin; Sherman Yip; Håkan Zeffer; Marc Tremblay

Rock, Suns third-generation chip-multithreading processor, contains 16 high-performance cores, each of which can support two software threads. Rock uses a novel checkpoint-based architecture to support automatic hardware scouting under a load miss, speculative out-of-order retirement of instructions, and aggressive dynamic hardware parallelization of a sequential instruction stream. It is also the first processor to support transactional memory in hardware.

international symposium on performance analysis of systems and software | 2006

A statistical multiprocessor cache model

Erik J. Berg; Håkan Zeffer; Erik Hagersten

The introduction of general-purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developers need for short turn-around time, but also be accurate and flexible enough to provide trend-correct and intuitive feedback. This paper presents a novel sample-based method for analyzing the data locality of a multithreaded application. Very sparse data is collected during a single execution of the studied application. The architectural-independent information collected during the execution is fed to a mathematical memory-system model for predicting the cache miss ratio. The sparse data can be used to characterize the applications data locality with respect to almost any possible memory system, such as complicated multiprocessor multilevel cache hierarchies. Any combination of cache size, cache-line size and degree of sharing can be modeled. Each modeled design point takes only a fraction of a second to evaluate, even though the application from which the sampled data was collected may have executed for hours. This makes the tool not just usable for software developers, but also for hardware developers who need to evaluate a huge memory-system design space. The accuracy of the method is evaluated using a large number of commercial and technical multi-threaded applications. The result produced by the algorithm is shown to be consistent with results from a traditional (and much slower) architecture simulation.

international conference on supercomputing | 2006

TMA: a trap-based memory architecture

Håkan Zeffer; Zoran Radovic; Martin Karlsson; Erik Hagersten

The advances in semiconductor technology have set the shared-memory server trend towards processors with multiple cores per die and multiple threads per core. We believe that this technology shift forces a reevaluation of how to interconnect multiple such chips to form larger systems.This paper argues that by adding support for coherence traps in future chip multiprocessors, large-scale server systems can be formed at a much lower cost. This is due to shorter design time, verification and time to market when compared to its traditional all-hardware counter part. In the proposed trap-based memory architecture (TMA), software trap handlers are responsible for obtaining read/write permission, whereas the coherence trap hardware is responsible for the actual permission check.In this paper we evaluate a TMA implementation (called TMA Lite) with a minimal amount of hardware extensions, all contained within the processor. The proposed mechanisms for coherence trap processing should not affect the critical path and have a negligible cost in terms of area and power for most processor designs.Our evaluation is based on detailed full system simulation using out-of-order processors with one or two dual-threaded cores per die as processing nodes. The results show that a TMA based distributed shared memory system can perform on par with a highly optimized hardware based design.

conference on high performance computing (supercomputing) | 2007

A case for low-complexity MP architectures

Håkan Zeffer; Erik Hagersten

Advances in semiconductor technology have driven shared-memory servers toward processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low-complexity multi-chip designs supporting an efficient inter-node coherence protocol implemented in software. We argue that our primitives and the example design presented in this paper have lower hardware overhead, have easier (and later) verification requirements, and provide the opportunity for flexible coherence protocols and simpler protocol bug corrections than traditional designs. Our evaluation is based on detailed full-system simulations of modern chip-multiprocessors and both commercial and HPC workloads. We compare a low-complexity system based on the proposed primitives with aggressive hardware multi-chip shared-memory systems and show that the performance is competitive across a large design space.

international parallel and distributed processing symposium | 2006

Exploiting locality: a flexible DSM approach

Håkan Zeffer; Zoran Radovic; Erik Hagersten

No single coherence strategy suits all applications well. Many promising adaptive protocols and coherence predictors, capable of dynamically modifying the coherence strategy, have been suggested over the years. While most dynamic detection schemes rely on plentiful of dedicated hardware, the customization technique suggested in this paper requires no extra hardware support for its per-application coherence strategy. Instead, each application is profiled using a low-overhead profiling tool. The appropriate coherence flag setting, suggested by the profiling, is specified when the application is launched. We have compared the performance of a hardware DSM (Sun WildFire) to a software DSM (distributed shared memory) built with identical interconnect hardware and coherence strategy. With no support for flexibility, the software DSM runs on average 45 percent slower than the hardware DSM on the 12 studied applications, while the flexibility can get the software DSM within 11 percent. Our all-software system outperforms the hardware DSM on four applications

Archive | 2007