Evan Speight
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Evan Speight.
international symposium on computer architecture | 2009
Xiaoxia Wu; Jian Li; Lixin Zhang; Evan Speight; Ramakrishnan Rajamony; Yuan Xie
Caching techniques have been an efficient mechanism for mitigating the effects of the processor-memory speed gap. Traditional multi-level SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core-to-cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power and performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this paper, we propose to take advantage of the best characteristics that each technology offers, through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intra-cache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 7% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 25 workloads. A more aggressive RHCA-based design provides 12% IPC improvement over the baseline. Finally, a 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 18% IPC improvement over the baseline. Furthermore, up to 70% reduction in power consumption over a baseline SRAM-only design is achieved.
measurement and modeling of computer systems | 2004
Patrick J. Bohrer; James L. Peterson; Mootaz Elnozahy; Ram Rajamony; Ahmed Gheith; Ron Rockhold; Charles R. Lefurgy; Hazim Shafi; Tarun Nakra; Rick Simpson; Evan Speight; Kartik Sudeep; Eric Van Hensbergen; Lixin Zhang
Mambo is a full-system simulator for modeling PowerPC-based systems. It provides building blocks for creating simulators that range from purely functional to timing-accurate. Functional versions support fast emulation of individual PowerPC instructions and the devices necessary for executing operating systems. Timing-accurate versions add the ability to account for device timing delays, and support the modeling of the PowerPC processor microarchitecture. We describe our experience in implementing the simulator and its uses within IBM to model future systems, support early software development, and design new system software.
design, automation, and test in europe | 2009
Xiaoxia Wu; Jian Li; Lixin Zhang; Evan Speight; Yuan Xie
Caches made of non-volatile memory technologies, such as Magnetic RAM (MRAM) and Phase-change RAM (PRAM), offer dramatically different power-performance characteristics when compared with SRAM-based caches, particularly in the areas of static/dynamic power consumption, read and write access latency and cell density. In this paper, we propose to take advantage of the best characteristics that each technology has to offer through the use of read-write aware Hybrid Cache Architecture (RWHCA) designs, where a single level of cache can be partitioned into read and write regions, each of a different memory technology with disparate read and write characteristics. We explore the potential of hardware support for intra-cache data movement within RWHCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that a RWHCA design with a conservative setup can provide a geometric mean 55% power reduction and yet 5% IPC improvement over a baseline SRAM cache design across a collection of 30 workloads. Furthermore, a 2-layer 3D cache stack (3DRWHCA) of high density memory technology with the same chip footprint still gives 10% power reduction and boost performance by 16% IPC improvement over the baseline.
international symposium on computer architecture | 2005
Evan Speight; Hazim Shafi; Lixin Zhang; Ramakrishnan Rajamony
With the ability to place large numbers of transistors on a single silicon chip, manufacturers have begun developing chip multiprocessors (CMPs) containing multiple processor cores, varying amounts of level 1 and level 2 caching, and on-chip directory structures for level 3 caches and memory. The level 3 cache may be used as a victim cache for both modified and clean lines evicted from on-chip level 2 caches. Efficient area and performance management of this cache hierarchy is paramount given the projected increase in access latency to off-chip memory. This paper proposes simple architectural extensions and adaptive policies for managing the L2 and L3 cache hierarchy in a CMP system. In particular, we evaluate two mechanisms that improve cache effectiveness. First, we propose the use of a small history table to provide hints to the L2 caches as to which lines are resident in the L3 cache. We employ this table to eliminate some unnecessary clean write backs to the L3 cache, reducing pressure on the L3 cache and utilization of the on-chip bus. Second, we exam-ine the performance benefits of allowing write backs from L2 caches to be placed in neighboring, on-chip L2 caches rather than forcing them to be absorbed by the L3 cache. This not only reduces the capacity pressure on the L3 cache but also makes subsequent accesses faster since L2-W-L2 cache transfers have typically lower latencies than accesses to a large L3 cache array. We evaluate the performance improvement of these two designs, and their combined effect, on four commercial workloads and observe a reduction in the overall execution time of up to 13%.
international conference on supercomputing | 1999
Evan Speight; Hazim Abdel-Shafi; John K. Bennett
The Virtual Interface (VI) Architecture provides pr otected userlevel communication with high delivered bandwidth a nd low permessage latency, particularly for small messages. The VI Architecture attempts to reduce latency by eliminat ing user/kernel transitions on routine data transfers and by allowi ng direct use of user memory for network buffering. This results in significantly lower latencies than those achieved by network prot ocols such as TCP/IP and UDP. In this paper we examine the low-le vel performance of two VI implementations, one implemen ted in hardware, the other implemented in device driver so ftware. Using a set of low-level benchmarks, we measure bandwidth , la ency, and processor utilization as a function of message siz for the GigaNet cLAN and Tandem ServerNet VI implementation s. We report that both VI implementations offer significa nt performance advantage relative to the corresponding UDP impleme ntation on the same hardware. We also investigate the problem s associated with delivering this performance to distributed app lications running on clustered multiprocessor workstations. Using an existing MPI library implemented using UDP as a bas eline, we explore several performance and implementation issu es that arise when adapting this library to use VI instead of UDP . Among these issues are memory registration costs, polling vs. blocking, reliability of delivery, and memory-to-memory copyi ng. By eliminating explicit acknowledgements, reducing mem ory-tomemory copying, and choosing the most appropriate synchronization primitives, we reduced the message l tency seen by user applications by an average of 55% across al l message sizes. We also identify several areas that offer t h potential for further performance enhancement.
international parallel and distributed processing symposium | 2007
Vipin Sachdeva; Michael Kistler; Evan Speight; Tzy-Hwa Kathy Tzeng
This paper evaluates the performance of bioinformatics applications on the Cell Broadband Engine recently developed at IBM. In particular we focus on two highly popular bioinformatics applications - FASTA and ClustalW. The characteristics of these bioinformatics applications, such as small critical time-consuming code size, regular memory accesses, existing vectorized code and embarrassingly parallel computation, make them uniquely suitable for the Cell processing platform. The price and power advantages afforded by the Cell processor also make it an attractive alternative to general purpose processors. We report preliminary performance results for these applications, and contrast these results with the state-of-the-art hardware.
ACM Transactions on Architecture and Code Optimization | 2010
Xiaoxia Wu; Jian Li; Lixin Zhang; Evan Speight; Ramakrishnan Rajamony; Yuan Xie
Traditional multilevel SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core--to--cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power-performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this article, we propose to take advantage of the best characteristics that each technology has to offer through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: intercache-Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intracache-level or cache-Region-based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intracache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 6% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 30 workloads. A more aggressive RHCA-based design provides 10% IPC improvement over the baseline. A 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 16% IPC improvement over the baseline. We also achieve up to a 72% reduction in power consumption over a baseline SRAM-only design. Energy-delay and thermal evaluation for 3DHCA are also presented. In addition to the fast-slow region based RHCA, we further evaluate read-write region based RHCA designs.
high-performance computer architecture | 1998
Evan Speight; John K. Bennett
This paper examines the performance benefits of employing multicast communication and application-level multithreading in the Brazos software distributed shared memory (DSM) system. Application-level multithreading in Brazos allows programs to transparently take advantage of available local multiprocessing. Brazos uses multicast communication to reduce the number of consistency-related messages, and employs two adaptive mechanisms that reduce the detrimental side effects of using multicast communication. We compare three software DSM systems running on identical hardware: (1) a single-threaded point-to-point system, (2) a multithreaded point-to-point system, and (3) Brazos, which incorporates both multithreading and multicast communication. For the six applications studied, multicast and multithreading improve speedup on eight processors by an average of 38%.
parallel computing | 2008
Vipin Sachdeva; Michael Kistler; Evan Speight; Tzy-Hwa Kathy Tzeng
This paper evaluates the performance of bioinformatics applications on the Cell Broadband Engine recently developed at IBM. In particular we focus on two highly popular bioinformatics applications - FASTA and ClustalW. The characteristics of these bioinformatics applications, such as small critical time-consuming code size, regular memory accesses, existing vectorized code and embarrassingly parallel computation, make them uniquely suitable for the Cell processing platform. The price and power advantages afforded by the Cell processor also make it an attractive alternative to general purpose processors. We report preliminary performance results for these applications, and contrast these results with the state-of-the-art hardware.
conference on high performance computing (supercomputing) | 2004
Jian Ke; Martin Burtscher; Evan Speight
Communication-intensive parallel applications spend a significant amount of their total execution time exchanging data between processes, which leads to poor performance in many cases. In this paper, we investigate message compression in the context of large-scale parallel message-passing systems to reduce the communication time of individual messages and to improve the bandwidth of the overall system. We implement and evaluate the cMPI message-passing library, which quickly compresses messages on-the-fly with a low enough overhead that a net execution time reduction is obtained. Our results on six large-scale benchmark applications show that their execution speed improves by up to 98% when message compression is enabled.