David Ofelt | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Ofelt is active.

Explore More

Publication

Featured researches published by David Ofelt.

international symposium on computer architecture | 1994

The Stanford FLASH multiprocessor

Jeffrey S. Kuskin; David Ofelt; Mark Heinrich; John Heinlein; Richard T. Simoni; Kourosh Gharachorloo; John M. Chapin; David Nakahira; Joel Baxter; Mark Horowitz; Anoop Gupta; Mendel Rosenblum; John L. Hennessy

The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machines global memory, a port to the interconnection network, an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The use of the protocol processor makes FLASH very flexible --- it can support a variety of different communication mechanisms --- and simplifies the design and implementation.This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASHs current status.

architectural support for programming languages and operating systems | 1994

The performance impact of flexibility in the Stanford FLASH multiprocessor

Mark Heinrich; Jeffrey S. Kuskin; David Ofelt; John Heinlein; Joel Baxter; Jaswinder Pal Singh; Richard T. Simoni; Kourosh Gharachorloo; David Nakahira; Mark Horowitz; Anoop Gupta; Mendel Rosenblum; John L. Hennessy

A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.

architectural support for programming languages and operating systems | 2000

FLASH vs. (simulated) FLASH: closing the simulation loop

Jeff Gibson; Robert Kunz; David Ofelt; Mark Horowitz; John L. Hennessy; Mark Heinrich

Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be difficult. More importantly, architects often assume without proof that although their simulator may make inaccurate absolute performance predictions, it will still accurately predict architectural trends.This paper studies the source and magnitude of error in a range of architectural simulators by comparing the simulated execution time of several applications and microbenchmarks to their execution time on the actual hardware being modeled. The existence of a hardware gold standard allows us to find, quantify, and fix simulator inaccuracies. We then use the simulators to predict architectural trends and analyze the sensitivity of the results to the simulator configuration. We find that most of our simulators predict trends accurately, as long as they model all of the important performance effects for the application in question. Unfortunately, it is difficult to know what these effects are without having a hardware reference, as they can be quite subtle. This calls into question the value, for architectural studies, of highly detailed simulators whose characteristics are not carefully validated against a real hardware design.

measurement and modeling of computer systems | 1996

Integrating performance monitoring and communication in parallel computers

Margaret Martonosi; David Ofelt; Mark Heinrich

A large and increasing gap exists between processor and memory speeds in scalable cache-coherent multiprocessors. To cope with this situation, programmers and compiler writers must increasingly be aware of the memory hierarchy as they implement software. Tools to support memory performance tuning have, however, been hobbled by the fact that it is difficult to observe the caching behavior of a running program. Little hardware support exists specifically for observing caching behavior; furthermore, what support does exist is often difficult to use for making fine-grained observations about program memory behavior.Our work observes that in a multiprocessor, the actions required for memory performance monitoring are similar to those required for enforcing cache coherence. In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring. We have demonstrated this idea by implementing the FlashPoint memory performance monitoring tool. FlashPoint is implemented as a special performance-monitoring coherence protocol for the Stanford FLASH Multiprocessor. By embedding performance monitoring into a cache-coherence scheme based on a programmable controller, we can gather detailed, per-data-structure, memory statistics with less than a 10% slowdown compared to unmonitored program executions. We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors.

design automation conference | 1998

Digital system simulation: methodologies and examples

Kunle Olukotun; Mark Heinrich; David Ofelt

Simulation serves many purposes during the design cycle of a digital system. In the early stages of design, high-level simulation is used for performance prediction and analysis. In the middle of the design cycle, simulation is used to develop the software algorithms and refine the hardware. In the later stages of design, simulation is used make sure performance targets are reached and to verify the correctness of the hardware and software. The different simulation objectives require varying levels of modeling detail. To keep design time to a minimum, it is critical to structure the simulation environment to make it possible to trade-off simulation performance for model detail in a flexible manner that allows concurrent hardware and software development. In this paper we describe the different simulation methodologies for developing complex digital systems, and give examples of one such simulation environment.

Proceedings of the IEEE | 1997

Hardware/software co-design of the Stanford FLASH multiprocessor

Mark Heinrich; David Ofelt; Mark Horowitz; John L. Hennessy

Hardware/software co-design is a methodology for solving design problems in systems with processors or embedded controllers where the design requirements mandate a functionality and performance level for the system, independent of the hardware and software boundary. In addition to the challenges of functional correctness and total system performance, design time is often a critical factor. To design MAGIC, the programmable memory and communication controller for the Stanford FLASH multiprocessor, the authors employed a hardware/software co-design methodology. This methodology allowed them to concurrently design the hardware and software thereby reducing design time while simultaneously ensuring that the design would meet ambitious performance goals. Serializing the hardware and software design would have lengthened the design time and significantly increased the amount of redesign when the tradeoffs between the hardware and software implementations became clear late in the design process. The co-design approach led them to build a series of hierarchical simulators that allowed them to begin design verification early and to reduce the level of effort required to ensure a functional design.

Archive | 1999