Anders Landin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anders Landin is active.

Explore More

Publication

Featured researches published by Anders Landin.

IEEE Computer | 1992

DDM-a cache-only memory architecture

Erik Hagersten; Anders Landin; Seif Haridi

The Data Diffusion Machine (DDM), a cache-only memory architecture (COMA) that relies on a hierarchical network structure, is described. The key ideas behind DDM are introduced by describing a small machine, which could be a COMA on its own or a subsystem of a larger COMA, and its protocol. A large machine with hundreds of processors is also described. The DDM prototype project is discussed, and simulated performance results are presented. >

international parallel processing symposium | 1994

Queue locks on cache coherent multiprocessors

Peter S. Magnusson; Anders Landin; Erik Hagersten

Large-scale shared-memory multiprocessors typically have long latencies for remote data accesses. A key issue for execution performance of many common applications is the synchronization cost. The communication scalability of synchronization has been improved by the introduction of queue-based spin-locks instead of Test&(Test&Set). For architectures with long access latencies for global data, attention should also be paid to the number of global accesses that are involved in synchronization. We present a method to characterize the performance of proposed queue lock algorithms, and apply it to previously published algorithms. We also present two new queue locks, the LH lock and the M lock. We compare the locks in terms of performance, memory requirements, code size and required hardware support. The LH lock is the simplest of all the locks, yet requires only an atomic swap operation. The M lock is superior in terms of global accesses needed to perform synchronization and still competitive in all other criteria. We conclude that the M lock is the best overall queue lock for the class of architectures studied.<<ETX>>

high performance computer architecture | 1995

An argument for simple COMA

Ashley Saulsbury; Tim Wilkinson; John B. Carter; Anders Landin

We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity-similarly to distributed virtual shared memory (DVSM) systems, leaving simpler hardware to maintain shared memory coherence at a cache line granularity. By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all hardware designs. >

international symposium on microarchitecture | 2009

Rock: A High-Performance Sparc CMT Processor

Shailender Chaudhry; Robert E. Cypher; Magnus Ekman; Martin Karlsson; Anders Landin; Sherman Yip; Håkan Zeffer; Marc Tremblay

Rock, Suns third-generation chip-multithreading processor, contains 16 high-performance cores, each of which can support two software threads. Rock uses a novel checkpoint-based architecture to support automatic hardware scouting under a load miss, speculative out-of-order retirement of instructions, and aggressive dynamic hardware parallelization of a sequential instruction stream. It is also the first processor to support transactional memory in hardware.

international symposium on computer architecture | 1991

Race-free interconnection networks and multiprocessor consistency

Anders Landin; Erik Hagersten; Seif Haridi

Modern shared-memory multiprocmors require complex interconnection networks to provide sufficient communication bandwidth between processors. They also rely on advanced memory systems that allow multiple memory operations to be made in parallel. It is expensive to maintain a high consistency level in a machine based on a general network, but for special interconnection topologies, some of these costs can he reduced. We define and study one class of interconnection networks, race-free networks. New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph. We show that this can be done in racefree networks without the need for a transaction to be globally performed before the next transaction can be issued: We also investigate what is required to maintain processor consistency in race-free networks. In a race-free network which maintains processor consistency, writes may be pipelined, and reads may bypass writes. - The proposed methods reduce the latencies associated with processor write-misses to shared data.

hawaii international conference on system sciences | 1994

Simple COMA node implementations

Erik Hagersten; Ashley Saulsbury; Anders Landin

Shared memory architectures often have caches to reduce the number of slow remote memory accesses. The largest possible caches exist in shared memory architectures called Cache-Only Memory Architectures (COMAs). In a COMA all the memory resources are used to implement large caches. Unfortunately, these large caches also have their price. Due to its lack of physically shared memory, COMA may suffer from a longer remote access latency than alternatives. Large COMA caches might also introduce an extra latency for local memory accesses, unless the node architecture is designed with care. The authors examine the implementation of COMAs, and consider how to move much of the complex functionality into software. They introduce the idea of a simple COMA architecture, a hybrid with hardware support only for the functionality frequently used. Such a system is expected to have good performance, and because of its simplicity it should be quick and cheap to develop and engineer.<<ETX>>

high-performance computer architecture | 1996

Bus-based COMA-reducing traffic in shared-bus multiprocessors

Anders Landin; Fredrik Dahlgren

A problem with bus-based shared-memory multiprocessors is that the shared bus rapidly becomes a bottleneck in the machine, effectively limiting the machine size to somewhere between ten and twenty processors. We propose a new architecture, the bus-based COMA (BB-COMA) that addresses this problem. Compared to the standard UMA architecture, the BE-COMA has lower requirements on bus bandwidth. We have used program-driven simulation to study the two architectures running applications from the SPLASH suite. We observed a traffic reduction of up to 70% for BB-COMA, with an average of 46%, for the programs studied. The results indicate that the BB-COMA is an interesting candidate architecture for future implementations of shared-bus multiprocessors.

international parallel and distributed processing symposium | 2003

The coherence predictor cache: a resource-efficient and accurate coherence prediction infrastructure

Jim Nilsson; Anders Landin; Per Stenström

Two-level coherence predictors have shown great promise to reduce coherence overhead in shared memory multiprocessors. However, to be accurate they require a memory overhead that on e.g. a 64-processor machine can be as high as 50%.Based on an application case study consisting of seven applications from SPLASH-2, a first observation made in this paper is that memory blocks subject to coherence activities usually constitute only a small fraction (around 10%) of the entire application footprint. Based on this, we contribute with a new class of resource-efficient coherence predictors that is organized as a cache attached to each memory controller. We show that such a Coherence Predictor Cache (CPC) can provide nearly as effective predictions as if a predictor is associated with every memory block, but needs only 2-7% as many predictors.

high-performance computer architecture | 1997

Reducing the replacement overhead in bus-based COMA multiprocessors

Fredrik Dahlgren; Anders Landin

In a multiprocessor with a Cache-Only Memory Architecture (COMA) all available memory is used to form large cache memories called attraction memories. These large caches help to satisfy shared memory accesses locally, reducing the need for node-external communication. However since a COMA has no back-up main memory, blocks replaced from one attraction memory must be relocated into another attraction memory. To keep memory overhead low, it is desirable to have most of the memory space filled with unique data. This leaves little space left for replication of cache blocks, resulting in that replacement traffic may become excessive. We have studied two schemes for removing the traditional demand for full inclusion between the lower-level caches and the attraction memory: the loose-inclusion and no-inclusion schemes. They differ in efficiency but also in implementation cost. Detailed simulation results show that the replacement traffic is reduced substantially for both approaches, indicating that breaking inclusion is an efficient way to bound the sensitivity for high memory pressure in COMA machines.

international conference on parallel architectures and languages europe | 1993

Simulating the Data Diffusion Machine

Erik Hagersten; Mats Grindal; Anders Landin; Ashley Saulsbury; Bengt Werner; Seif Haridi

Large-scale multiprocessors suffer from long latencies for remote accesses. Caching is by far the most popular technique for hiding such delays. Caching not only hides the delay, but also decreases the network load. Cache-Only Memory Architectures (COMA), have no physically shared memory. Instead, all the memory resources are invested in caches, enabling in caches of the largest possible size. A datum has no home, and is moved by a protocol between the caches according to its usage. Furthermore, it might exist in multiple caches. Even though no shared memory exists in the traditional sense, the architecture provides a shared memory view to a processor, and hence also to the programmer. The simulation results of large programs running on up to 128 processors indicate that the COMA adapts well to existing shared memory programs. They also show that an application with a poor locality can benefit by adopting the COMA principle of no fixed home for data, resulting in a reduction of execution time by a factor three.

Explore More