Kourosh Gharachorloo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kourosh Gharachorloo is active.

Explore More

Publication

Featured researches published by Kourosh Gharachorloo.

international symposium on computer architecture | 1990

Memory consistency and event ordering in scalable shared-memory multiprocessors

Kourosh Gharachorloo; Daniel E. Lenoski; James P. Laudon; Phillip B. Gibbons; Anoop Gupta; John L. Hennessy

Scalable shared-memory multiprocessors distribute memory among the processors and use scalable interconnection networks to provide high bandwidth and low latency communication. In addition, memory accesses are cached, buffered, and pipelined to bridge the gap between the slow shared memory and the fast processors. Unless carefully controlled, such architectural optimizations can cause memory accesses to be executed in an order different from what the programmer expects. The set of allowable memory access orderings forms the memory consistency model or event ordering model for an architecture. This paper introduces a new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models. A framework for classifying shared accesses and reasoning about event ordering is developed. The release consistency model is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization. Possible performance gains from the less strict constraints of the release consistency model are explored. Finally, practical implementation issues are discussed, concentrating on issues relevant to scalable architectures.

IEEE Computer | 1996

Shared memory consistency models: a tutorial

Sarita V. Adve; Kourosh Gharachorloo

The memory consistency model of a system affects performance, programmability, and portability. We aim to describe memory consistency models in a way that most computer professionals would understand. This is important if the performance-enhancing features being incorporated by system designers are to be correctly and widely used by programmers. Our focus is consistency models proposed for hardware-based shared memory systems. Most of these models emphasize the system optimizations they support, and we retain this system-centric emphasis. We also describe an alternative, programmer-centric view of relaxed consistency models that describes them in terms of program behavior, not system optimizations.

international symposium on computer architecture | 1994

The Stanford FLASH multiprocessor

Jeffrey S. Kuskin; David Ofelt; Mark Heinrich; John Heinlein; Richard T. Simoni; Kourosh Gharachorloo; John M. Chapin; David Nakahira; Joel Baxter; Mark Horowitz; Anoop Gupta; Mendel Rosenblum; John L. Hennessy

The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machines global memory, a port to the interconnection network, an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The use of the protocol processor makes FLASH very flexible --- it can support a variety of different communication mechanisms --- and simplifies the design and implementation.This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASHs current status.

international symposium on computer architecture | 1990

The directory-based cache coherence protocol for the DASH multiprocessor

Daniel E. Lenoski; James P. Laudon; Kourosh Gharachorloo; Anoop Gupta; John L. Hennessy

DASH is a scalable shared-memory multiprocessor currently being developed at Stanfords Computer Systems Laboratory. The architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a scalable interconnection network. A key feature of DASH is its distributed directory-based cache coherence protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely on broadcast; instead it uses point-to-point messages sent between the processors and memories to keep caches consistent. Furthermore, the DASH system does not contain any single serialization or control point. While these features provide the basis for scalability, they also force a reevaluation of many fundamental issues involved in the design of a protocol. These include the issues of correctness, performance and protocol complexity. In this paper, we present the design of the DASH coherence protocol and discuss how it addresses the above issues. We also discuss our strategy for verifying the correctness of the protocol and briefly compare our protocol to the IEEE Scalable Coherent Interface protocol.

international symposium on computer architecture | 2000

Piranha: a scalable architecture based on single-chip multiprocessing

Luiz André Barroso; Kourosh Gharachorloo; Robert McNamara; Andreas Nowatzyk; Shaz Qadeer; Barton Sano; Scott Smith; Robert J. Stets; Ben Verghese

This paper describes the Piranha system, a research prototype being developed at Compaq that aggressively exploits chip multiprocessing by integrating eight simple Alpha processor cores along with a two-level cache hierarchy onto a single chip. Piranha also integrates further on-chip functionality to allow for scalable multiprocessor configurations to be built in a glueless and modular fashion. The use of simple processor cores combined with an industry-standard ASIC design methodology allow us to complete our prototype within a short time-frame, with a team size and investment that are an order of magnitude smaller than that of a commercial microprocessor. Our detailed simulation results show that while each Piranha processor core is substantially slower than an aggressive next-generation processor, the integration of eight cores onto a single chip allows Piranha to outperform next-generation processors by up to 2.9 times (on a per chip basis) on important workloads such as OLTP. This performance advantage can approach a factor of five by using full-custom instead of ASIC logic. In addition to exploiting chip multiprocessing, the Piranha prototype incorporates several other unique design choices including a shared second-level cache with no inclusion, a highly optimized cache coherence protocol, and a novel I/O architecture.

international symposium on computer architecture | 1998

Memory system characterization of commercial workloads

Luiz André Barroso; Kourosh Gharachorloo; Edouard Bugnion

Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as the critical performance factor for such workloads. However, most current server designs have been optimized to perform well on scientific and engineering workloads, potentially leading to design decisions that are non-ideal for commercial applications. The above problem is exacerbated by the lack of information on the performance requirements of commercial workloads, the lack of available applications for widespread study, and the fact that most representative applications are too large and complex to serve as suitable benchmarks for evaluating trade-offs in the design of processors and servers.This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches.

architectural support for programming languages and operating systems | 1996

Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Daniel J. Scales; Kourosh Gharachorloo; Chandramohan A. Thekkath

This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. In addition, the system allows the coherence granularity to vary across different shared data structures in a single application. Shasta implements the shared address space by transparently rewriting the application executable to intercept loads and stores. For each shared load or store, the inserted code checks to see if the data is available locally and communicates with other processors if necessary. The system uses numerous techniques to reduce the run-time overhead of these checks. Since Shasta is implemented entirely in software, it also provides tremendous flexibility in supporting different types of cache coherence protocols. We have implemented an efficient cache coherence protocol that incorporates a number of optimizations, including support for multiple communication granularities and use of relaxed memory models. This system is fully functional and runs on a cluster of Alpha workstations.The primary focus of this paper is to describe the techniques used in Shasta to reduce the checking overhead for supporting fine granularity sharing in software. These techniques include careful layout of the shared address space, scheduling the checking code for efficient execution on modern processors, using a simple method that checks loads using only the value loaded, reducing the extra cache misses caused by the checking code, and combining the checks for multiple loads and stores. To characterize the effect of these techniques, we present detailed performance results for the SPLASH-2 applications running on an Alpha processor. Without our optimizations, the checking overheads are excessively high, exceeding 100% for several applications. However, our techniques are effective in reducing these overheads to a range of 5% to 35% for almost all of the applications. We also describe our coherence protocol and present some preliminary results on the parallel performance of several applications running on our workstation cluster. Our experience so far indicates that once the cost of checking memory accesses is reduced using our techniques, the Shasta approach is an attractive software solution for supporting a shared address space with fine-grain access to data.

international symposium on computer architecture | 1991

Comparative evaluation of latency reducing and tolerating techniques

Anoop Gupta; John L. Hennessy; Kourosh Gharachorloo; Todd C. Mowry; Wolf-Dietrich Weber

Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale shared-memory multiprocessors. In this paper, we consider four architectural techniques that address the latency problem: (i) hardware coherent caches, (ii) relaxed memory consistency, (iii) softwareconuolled prefetching, and (iv) multiple-context suppon. We some studies of benefits of the individual techniques have been done, no Study evaluates all of the techniques within a consistent framework. This paper attempts to remedy this by providing a comprehensive evaluation of the benefits of the four techniques, both individually and in combinations, using a consistent set of architectural assumptions. The results in this paper have been obtained using detailed simulations of a large-scale shared-memory multiprocessor. Our results show that caches and relaxed consistency UNformly improve performance. The improvements due to prefetching and multiple contexts are sizeable, but are much more applicationdependent. Combinations of the various techniques generally amin better performance than each one on its own. Overall, we show that using suitahle combinations of the techniques, performance can be improved by 4 to 7 dmes

architectural support for programming languages and operating systems | 1998

Performance of database workloads on shared-memory systems with out-of-order processors

Parthasarathy Ranganathan; Kourosh Gharachorloo; Sarita V. Adve; Luiz André Barroso

Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized to perform well on scientific and engineering workloads. Given the radically different behavior of database workloads (especially OLTP), it is important to re-evaluate key system design decisions in the context of this important class of applications.This paper examines the behavior of database workloads on shared-memory multiprocessors with aggressive out-of-order processors, and considers simple optimizations that can provide further performance improvements. Our study is based on detailed simulations of the Oracle commercial database engine. The results show that the combination of out-of-order execution and multiple instruction issue is indeed effective in improving performance of database workloads, providing gains of 1.5 and 2.6 times over an in-order single-issue processor for OLTP and DSS, respectively. In addition, speculative techniques enable optimized implementations of memory consistency models that significantly improve the performance of stricter consistency models, bringing the performance to within 10--15% of the performance of more relaxed models.The second part of our study focuses on the more challenging OLTP workload. We show that an instruction stream buffer is effective in reducing the remaining instruction stalls in OLTP, providing a 17% reduction in execution time (approaching a perfect instruction cache to within 15%). Furthermore, our characterization shows that a large fraction of the data communication misses in OLTP exhibit migratory behavior; our preliminary results show that software prefetch and writeback/flush hints can be used for this data to further reduce execution time by 12%.

architectural support for programming languages and operating systems | 2000

Architecture and design of AlphaServer GS320

Kourosh Gharachorloo; Madhu Sharma; Simon C. Steely; Stephen R. Van Doren

This paper describes the architecture and implementation of the AlphaServer GS320, a cache-coherent non-uniform memory access multiprocessor developed at Compaq. The AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing with 32 to 64 processors. Each node in the design consists of four Alpha 21264 processors, up to 32GB of coherent memory, and an aggressive IO subsystem. The current implementation supports up to 8 such nodes for a total of 32 processors. While snoopy-based designs have been stretched to medium-scale multiprocessors by some vendors, providing sufficient snoop bandwidth remains a major challenge especially in systems with aggressive processors. At the same time, directory protocols targeted at larger scale designs lead to a number of inherent inefficiencies relative to snoopy designs. A key goal of the AlphaServer GS320 architecture has been to achieve the best-of-both-worlds, partly by exploiting the bounded scale of the target systems.This paper focuses on the unique design features used in the AlphaServer GS320 to efficiently implement coherence and consistency. The guiding principle for our directory-based protocol is to address correctness issues related to rare protocol races without burdening the common transaction flows. Our protocol exhibits lower occupancy and lower message counts compared to previous designs, and provides more efficient handling of 3-hop transactions. Furthermore, our design naturally lends itself to elegant solutions for deadlock, livelock, starvation, and fairness. The AlphaServer GS320 architecture also incorporates a couple of innovative techniques that extend previous approaches for efficiently implementing memory consistency models. These techniques allow us to generate commit events (which are used for ordering purposes) well in advance of formulating the reply to a transaction. Furthermore, the separation of the commit event allows time-critical replies to bypass inbound requests without violating ordering properties. Even though our design specifically targets medium-scale servers, many of the same techniques can be applied to larger-scale directory-based and smaller-scale snoopy-based designs. Finally, we evaluate the performance impact of some of the above optimizations and present a few competitive benchmark results.

Explore More