Mark Heinrich | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mark Heinrich is active.

Explore More

Publication

Featured researches published by Mark Heinrich.

international symposium on computer architecture | 1994

The Stanford FLASH multiprocessor

Jeffrey S. Kuskin; David Ofelt; Mark Heinrich; John Heinlein; Richard T. Simoni; Kourosh Gharachorloo; John M. Chapin; David Nakahira; Joel Baxter; Mark Horowitz; Anoop Gupta; Mendel Rosenblum; John L. Hennessy

The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machines global memory, a port to the interconnection network, an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The use of the protocol processor makes FLASH very flexible --- it can support a variety of different communication mechanisms --- and simplifies the design and implementation.This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASHs current status.

architectural support for programming languages and operating systems | 1994

The performance impact of flexibility in the Stanford FLASH multiprocessor

Mark Heinrich; Jeffrey S. Kuskin; David Ofelt; John Heinlein; Joel Baxter; Jaswinder Pal Singh; Richard T. Simoni; Kourosh Gharachorloo; David Nakahira; Mark Horowitz; Anoop Gupta; Mendel Rosenblum; John L. Hennessy

A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.

architectural support for programming languages and operating systems | 2000

FLASH vs. (simulated) FLASH: closing the simulation loop

Jeff Gibson; Robert Kunz; David Ofelt; Mark Horowitz; John L. Hennessy; Mark Heinrich

Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be difficult. More importantly, architects often assume without proof that although their simulator may make inaccurate absolute performance predictions, it will still accurately predict architectural trends.This paper studies the source and magnitude of error in a range of architectural simulators by comparing the simulated execution time of several applications and microbenchmarks to their execution time on the actual hardware being modeled. The existence of a hardware gold standard allows us to find, quantify, and fix simulator inaccuracies. We then use the simulators to predict architectural trends and analyze the sensitivity of the results to the simulator configuration. We find that most of our simulators predict trends accurately, as long as they model all of the important performance effects for the application in question. Unfortunately, it is difficult to know what these effects are without having a hardware reference, as they can be quite subtle. This calls into question the value, for architectural studies, of highly detailed simulators whose characteristics are not carefully validated against a real hardware design.

measurement and modeling of computer systems | 1996

Integrating performance monitoring and communication in parallel computers

Margaret Martonosi; David Ofelt; Mark Heinrich

A large and increasing gap exists between processor and memory speeds in scalable cache-coherent multiprocessors. To cope with this situation, programmers and compiler writers must increasingly be aware of the memory hierarchy as they implement software. Tools to support memory performance tuning have, however, been hobbled by the fact that it is difficult to observe the caching behavior of a running program. Little hardware support exists specifically for observing caching behavior; furthermore, what support does exist is often difficult to use for making fine-grained observations about program memory behavior.Our work observes that in a multiprocessor, the actions required for memory performance monitoring are similar to those required for enforcing cache coherence. In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring. We have demonstrated this idea by implementing the FlashPoint memory performance monitoring tool. FlashPoint is implemented as a special performance-monitoring coherence protocol for the Stanford FLASH Multiprocessor. By embedding performance monitoring into a cache-coherence scheme based on a programmable controller, we can gather detailed, per-data-structure, memory statistics with less than a 10% slowdown compared to unmonitored program executions. We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors.

Proceedings of the IEEE | 1999

Cache-coherent distributed shared memory: perspectives on its development and future challenges

John L. Hennessy; Mark Heinrich; Anoop Gupta

Distributed shared memory is an architectural approach that allows multiprocessors to support a single shared address space that is implemented with physically distributed memories. Hardware-supported distributed shared memory is becoming the dominant approach for building multiprocessors with moderate to large numbers of processors. Cache coherence allows such architectures to use caching to take advantage of locality in applications without changing the programmers model of memory. We review the key developments that led to the creation of cache-coherent distributed shared memory and describe the Stanford DASH multiprocessor, the first working implementation of hardware-supported scalable cache coherence. We then provide a perspective on such architectures and discuss important remaining technical challenges.

international symposium on computer architecture | 2004

SMTp: An Architecture for Next-generation Scalable Multi-threading

Mainak Chaudhuri; Mark Heinrich

We introduce the SMTp architecture - an SMT processor augmented with a coherence protocol thread context, that together with a standard integrated memory controller can enable the design of (among other possibilities) scalable cache-coherent hardware distributed shared memory (DSM) machines from commodity nodes. We describe the minor changes needed to a conventional out-of-order multi-threaded core to realize SMTp, discussing issues related to both deadlock avoidance and performance. We then compare SMTp performance to that of various conventional DSM machines with normal SMT processors both with and without integrated memory controllers. On configurations from 1 to 32 nodes, with 1 to 4 application threads per node, we find that SMTp delivers performance comparable to, and sometimes better than, machines with more complex integrated DSM-specific memory controllers. Our results also show that the protocol thread has extremely low pipeline overhead. Given the simplicity and the flexibility of the SMTp mechanism, we argue that next-generation multi-threaded processors with integrated memory controllers should adopt this mechanism as a way of building less complex high-performance DSM multiprocessors.

design automation conference | 1998

Digital system simulation: methodologies and examples

Kunle Olukotun; Mark Heinrich; David Ofelt

Simulation serves many purposes during the design cycle of a digital system. In the early stages of design, high-level simulation is used for performance prediction and analysis. In the middle of the design cycle, simulation is used to develop the software algorithms and refine the hardware. In the later stages of design, simulation is used make sure performance targets are reached and to verify the correctness of the hardware and software. The different simulation objectives require varying levels of modeling detail. To keep design time to a minimum, it is critical to structure the simulation environment to make it possible to trade-off simulation performance for model detail in a flexible manner that allows concurrent hardware and software development. In this paper we describe the different simulation methodologies for developing complex digital systems, and give examples of one such simulation environment.

international conference on supercomputing | 2002

Leveraging cache coherence in active memory systems

Daehyun Kim; Mainak Chaudhuri; Mark Heinrich

Active memory systems help processors overcome the memory wall when applications exhibit poor cache behavior. They consist of either active memory elements that perform data parallel computations in the memory system itself, or an active memory controller that supports address re-mapping techniques that improve data locality. Both active memory approaches create coherence problems---even on uniprocessor systems---since there are either additional processors operating on the data directly, or the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we propose a new technique to solve the coherence problem by extending the coherence protocol. Our active memory controller leverages and extends the coherence mechanism, so that re-mapping techniques work transparently on both uniprocessor and multiprocessor systems.We present a microarchitecture for an active memory controller with a programmable core and specialized hardware that accelerates cache line assembly and disassembly. We present detailed simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. In addition to uniprocessor speedup, we show single-node multiprocessor speedup for parallel active memory applications and discuss how the same controller architecture supports coherent multi-node systems called active memory clusters.

IEEE Transactions on Computers | 1999

A quantitative analysis of the performance and scalability of distributed shared memory cache coherence protocols

Mark Heinrich; Vijayaraghavan Soundararajan; John L. Hennessy; Anoop Gupta

Scalable cache coherence protocols have become the key technology for creating moderate to large-scale shared-memory multiprocessors. Although the performance of such multiprocessors depends critically on the performance of the cache coherence protocol, little comparative performance data is available. Existing commercial implementations use a variety of different protocols, including bit-vector/coarse-vector protocols, SCI-based protocols, and COMA protocols. Using the programmable protocol processor of the Stanford FLASH multiprocessor, we provide a detailed, implementation-oriented evaluation of four popular cache coherence protocols. In addition to measurements of the characteristics of protocol execution (e.g., memory overhead, protocol execution time, and message count) and of overall performance, we examine the effects of scaling the processor count from 1 to 128 processors. Surprisingly, the optimal protocol changes for different applications and can change with processor count even within the same application. These results help identify the strengths of specific protocols and illustrate the benefits of providing flexibility in the choice of cache coherence protocol.

IEEE Transactions on Computers | 2004

Architectural support for uniprocessor and multiprocessor active memory systems

Daehyun Kim; Mainak Chaudhuri; Mark Heinrich; Evan Speight

We introduce an architectural approach to improve memory system performance in both uniprocessor and multiprocessor systems. The architectural innovation is a flexible active memory controller backed by specialized cache coherence protocols that permit the transparent use of address remapping techniques. The resulting system shows significant performance improvement across a spectrum of machine configurations, from uniprocessors through single-node multiprocessors (SMPs) to distributed shared memory clusters (DSMs). Address remapping techniques exploit the data access patterns of applications to enhance their cache performance. However, they create coherence problems since the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we present a new approach to solve the coherence problem. We leverage and extend the cache coherence protocol so that our techniques work transparently to efficiently support uniprocessor, SMP and DSM active memory systems. We detail the coherence protocol extensions to support our active memory techniques and present simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. We also show remarkable performance improvement on small to medium-scale SMP and DSM multiprocessors, allowing some parallel applications to continue to scale long after their performance levels off on normal systems.

Explore More