Hazim Shafi
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hazim Shafi.
international conference on supercomputing | 2005
Jaehyuk Huh; Changkyu Kim; Hazim Shafi; Lixin Zhang; Doug Burger; Stephen W. Keckler
We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.
measurement and modeling of computer systems | 2004
Patrick J. Bohrer; James L. Peterson; Mootaz Elnozahy; Ram Rajamony; Ahmed Gheith; Ron Rockhold; Charles R. Lefurgy; Hazim Shafi; Tarun Nakra; Rick Simpson; Evan Speight; Kartik Sudeep; Eric Van Hensbergen; Lixin Zhang
Mambo is a full-system simulator for modeling PowerPC-based systems. It provides building blocks for creating simulators that range from purely functional to timing-accurate. Functional versions support fast emulation of individual PowerPC instructions and the devices necessary for executing operating systems. Timing-accurate versions add the ability to account for device timing delays, and support the modeling of the PowerPC processor microarchitecture. We describe our experience in implementing the simulator and its uses within IBM to model future systems, support early software development, and design new system software.
IEEE Transactions on Parallel and Distributed Systems | 2007
Jaehyuk Huh; Changkyu Kim; Hazim Shafi; Lixin Zhang; Doug Burger; Stephen W. Keckler
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.
international symposium on computer architecture | 2005
Evan Speight; Hazim Shafi; Lixin Zhang; Ramakrishnan Rajamony
With the ability to place large numbers of transistors on a single silicon chip, manufacturers have begun developing chip multiprocessors (CMPs) containing multiple processor cores, varying amounts of level 1 and level 2 caching, and on-chip directory structures for level 3 caches and memory. The level 3 cache may be used as a victim cache for both modified and clean lines evicted from on-chip level 2 caches. Efficient area and performance management of this cache hierarchy is paramount given the projected increase in access latency to off-chip memory. This paper proposes simple architectural extensions and adaptive policies for managing the L2 and L3 cache hierarchy in a CMP system. In particular, we evaluate two mechanisms that improve cache effectiveness. First, we propose the use of a small history table to provide hints to the L2 caches as to which lines are resident in the L3 cache. We employ this table to eliminate some unnecessary clean write backs to the L3 cache, reducing pressure on the L3 cache and utilization of the on-chip bus. Second, we exam-ine the performance benefits of allowing write backs from L2 caches to be placed in neighboring, on-chip L2 caches rather than forcing them to be absorbed by the L3 cache. This not only reduces the capacity pressure on the L3 cache but also makes subsequent accesses faster since L2-W-L2 cache transfers have typically lower latencies than accesses to a large L3 cache array. We evaluate the performance improvement of these two designs, and their combined effect, on four commercial workloads and observe a reduction in the overall execution time of up to 13%.
Ibm Journal of Research and Development | 2003
Hazim Shafi; Patrick J. Bohrer; James Michael Phelan; Cosmin Rusu; James L. Peterson
This paper describes the design and validation of a performance and power simulator that is part of the Mambo simulation environment for PowerPC® systems. One of the most notable features of the simulator, designated as Tempo, is the incorporation of an event-driven power model. Tempo satisfies an important need for fast and accurate performance and power simulation tools at the system level. The power and performance predictions from the simulated model of a PowerPC 405GP (or simply 405GP) were validated against a 405GP-based evaluation board instrumented for power measurements using 42 application/dataset combinations from the EEMBC benchmark suite. The average performance and energy-prediction errors were 0.6% and -4.1%, respectively. In addition to describing Tempo, we show examples of how well it can predict the runtime power consumption of a 405GP microprocessor during application execution.
asia and south pacific design automation conference | 2006
Yukio Watanabe; Balazs Sallay; Brad W. Michael; Daniel Alan Brokenshire; Gavin B. Meil; Hazim Shafi; Daisuke Hiraoka
An instruction set level reference model was developed for the development of synergistic processing unit (SPU), which is one of the key components of the cell processor [Pham, 2005][Flachs, 2005]. This reference model was used for the simulators to define the instruction set architecture (ISA), for the random test case generator, for the reference in the verification environment and for the software development. Using the same reference model for multiple purposes made it easier to keep up with the architecture changes at the early stage of the microprocessor development. Also including the reference model in the simulation environment increased the robustness for the random test executions and made it possible to find bugs that are usually difficult to catch.
symposium on reliable distributed systems | 2003
Hazim Shafi; Evan Speight; John K. Bennett
Software distributed shared-memory (SDSM) provides the abstraction necessary to run shared-memory applications on cost-effective parallel platforms such as clusters of workstations. However, problems such as cluster component reliability and cluster management, which are not directly related to performance, need to be addressed before SDSM solutions can be widely adopted. This paper presents Raptor, an SDSM cluster management system based on checkpoint/recovery and thread migration. Raptor checkpoints decouple the runtime system and application data from application threads, allowing efficient load balancing, resource allocation, and rollback recovery. There are two important features of the system. First, it reduces checkpoint overhead by only saving application-specific data that cannot be recreated at recovery time. Second, by integrating thread migration capability both at running and recovery, it allows the addition or removal of computing resources from a running application, while adding little or no additional burden on the SDSM application programmer.
Archive | 2003
Elmootazbellah Nabil Elnozahy; James L. Peterson; Ramakrishnan Rajamony; Hazim Shafi
Archive | 2008
Dilma Da Silva; Elmootazbellah Nabil Elnozahy; Orran Krieger; Hazim Shafi; Xiaowei Shen; Balaram Sinharoy; Robert B. Tremaine
Archive | 2005
Ramakrishnan Rajamony; Hazim Shafi; Derek Edward Williams; Kenneth L. Wright