Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ashwini K. Nanda is active.

Publication


Featured researches published by Ashwini K. Nanda.


IEEE Transactions on Very Large Scale Integration Systems | 1998

Energy optimization of multilevel cache architectures for RISC and CISC processors

Uming Ko; Poras T. Balsara; Ashwini K. Nanda

In this paper, we present the characterization and design of energy-efficient, on chip cache memories. The characterization of power dissipation in on-chip cache memories reveals that the memory peripheral interface circuits and bit array dissipate comparable power. To optimize performance and power in a processors cache, a multidivided module (MDM) cache architecture is proposed to conserve energy in the bit array as well as the memory peripheral circuits. Compared to a conventional, nondivided, 16-kB cache, the latency and power of the MDM cache are reduced by a factor of 1.9 and 4.6, respectively. Based on the MDM cache architecture, the energy efficiency of the complete memory hierarchy is analyzed with respect to cache parameters in a multilevel processor cache design. This analysis was conducted by executing the SPECint92 benchmark programs with the miss ratios for reduced instruction set computer (RISC) and complex instruction set computer (CISC) machines.


architectural support for programming languages and operating systems | 2000

MemorIES3: a programmable, real-time hardware emulation tool for multiprocessor server design

Ashwini K. Nanda; Kwok-Ken Mak; Krishnan Sugarvanam; Ramendra K. Sahoo; Vijayaraghavan Soundarararjan; T. Basil Smith

Modern system design often requires multiple levels of simulation for design validation and performance debugging. However, while machines have gotten faster, and simulators have become more detailed, simulation speeds have not tracked machine speeds. As a result, it is difficult to simulate realistic problem sizes and hardware configurations for a target machine. Instead, researchers have focussed on developing scaling methodologies and running smaller problem sizes and configurations that attempt to represent the behavior of the real problem. Given the increasing size of problems today, it is unclear whether such an approach yields accurate results. Moreover, although commercial workloads are prevalent and important in todays marketplace, many simulation tools are unable to adequately profile such applications, let alone for realistic sizes.In this paper we present a hardware-based emulation tool that can be used to aid memory system designers. Our focus is on the memory system because the ever-widening gap between processor and memory speeds means that optimizing the memory subsystem is critical for performance. We present the design of the Memory Instrumentation and Emulation System (MemorIES). MemorIES is a programmable tool designed using FPGAs and SDRAMs. It plugs into an SMP bus to perform on-line emulation of several cache configurations, structures and protocols while the system is running real-life workloads in real-time, without any slowdown in application execution speed. We demonstrate its usefulness in several case studies, and find several important results. First, using traces to perform system evaluation can lead to incorrect results (off by 100% or more in some cases) if the trace size is not sufficiently large. Second, MemorIES is able to detect performance problems by profiling miss behavior over the entire course of a run, rather than relying on a small interval of time. Finally, we observe that previous studies of SPLASH2 applications using scaled application sizes can result in optimistic miss rates relative to real sizes on real machines, providing potentially misleading data when used for design evaluation.


high-performance computer architecture | 1999

Design and performance of directory caches for scalable shared memory multiprocessors

Maged M. Michael; Ashwini K. Nanda

Recent research shows that the occupancy of the coherence controllers is a major performance bottleneck for distributed cache coherent shared memory multiprocessors. A significant part of the occupancy is due to the latency of accessing the directory which is usually kept in DRAM memory. Most coherence controller designs that use protocol processors for executing the coherence protocol handlers use the data cache of the protocol processor for caching directory entries along with protocol handler data. Analogously, a fast Directory Cache (DC) can also be used by the hardwired coherence controller designs to minimize directory access time. The paper studies the performance of directory caches using parallel applications from the SPLASH-2 suite. We demonstrate that using a directory cache can result in 40% or more improvement in the execution time of communication intensive applications. We also investigate the various directory cache design parameters: cache size, cache line size, and associativity. Experimental results show that the directory cache size requirements grow sub-linearly with the increase in the applications data set size. The results also show the performance advantage of multi-entry directory cache lines, as a result of spatial locality and the absence of sharing of directories. The impact of the associativity of the directory caches on performance is less than that of the size and the line size. We also find a linear relation between the directory cache miss ratio and the coherence controller occupancy, and between both measures and the execution time of the applications.


international parallel processing symposium | 1997

Accuracy and speed-up of parallel trace-driven architectural simulation

Anthony-Trung Nguyen; Pradip Bose; Kattamuri Ekanadham; Ashwini K. Nanda; Maged M. Michael

Trace-driven simulation continues to be one of the main evaluation methods in the design of high performance processor-memory sub-systems. In this paper, we examine the varying speed-up opportunities available by processing a given trace in parallel on an IBM SP-2 machine. We also develop a simple, yet effective method of correcting for cold-start cache miss errors, by the use of overlapped trace chunks. We then report selected experimental results to validate our expectations. We show that it is possible to achieve near-perfect speedup without loss of accuracy. Next, in order to achieve further reduction in simulation cost, we combine uniform sampling methods with parallel trace processing with a slight loss of accuracy for finite-cache timer runs. We then show that by using warm-start sequences from preceding trace chunks, it is possible to reduce the errors back to acceptable bounds.


Ibm Journal of Research and Development | 2007

Cell/B.E. blades: building blocks for scalable, real-time, interactive, and digital media servers

Ashwini K. Nanda; James R. Moulic; R. E. Hanson; Gottfried Goldrian; M. N. Day; B. D. D'Arnora; Sreenivasulu Kesavarapu

The Cell Broadband Engine™ (Cell/B.E.) processor, developed jointly by Sony, Toshiba, and IBM primarily for next-generation gaming consoles, packs a level of floating-point, vector, and integer streaming performance in one chip that is an order of magnitude greater than that of traditional commodity microprocessors. Cell/B.E. blades are server and supercomputer building blocks that use the Cell/B.E. processor, the high-volume IBM BladeCenter® server platform, high-speed commodity networks, and open-system software. In this paper we present the design of the Cell/B.E. blades and discuss several early application prototypes and results.


high performance computer architecture | 2000

High-throughput coherence controllers

Ashwini K. Nanda; Anthony-Trung Nguyen; Maged M. Michael; Douglas J. Joseph

Recent research shows that the occupancy of the coherence controllers is a major performance bottleneck for distributed cache coherent shared memory multiprocessors. In this paper we study three approaches to alleviating this problem in hardwired coherence controllers, namely, multiple protocol engines, pipelined protocol engines, and split request-response streams. Split request-response streams is an innovative contribution of this paper. The performance of pipelining in the context of coherence controllers has not been presented in the literature. Multiple protocol engines has not been studied in the context of hardwired controllers except for a study of ours and only to a limited extent. Using both commercial and scientific benchmarks on detailed simulation models, we present experimental results that show that each mechanism is highly effective at reducing controller occupancy by as much as 66% and improving execution time by as much as 51%, for applications with high communication bandwidth requirement. A combination of mechanisms further reduces controller occupancy and execution time by as much as 78% and 61%, respectively. Our results show that applying any of the parallel mechanisms in the coherence controllers allows integrating four times as many processors per coherence controller, thus reducing system cost, while maintaining or even exceeding the performance of systems with larger number of coherence controllers.


Ibm Journal of Research and Development | 2001

High-throughput coherence control and hardware messaging in everest

Ashwini K. Nanda; Anthony-Trung Nguyen; Maged M. Michael; Douglas J. Joseph

Everest is an architecture for high-performance cache coherence and message passing in partitionable distributed shared-memory systems that use commodity shared multiprocessors (SMPs) as building blocks. The Everest architecture is intended for use in designing future IBM servers using either PowerPC® or Intel® processors. Everest provides high-throughput protocol handling in three dimensions: multiple protocol engines, split request-response handling, and pipelined design. It employs an efficient directory subsystem design that matches the directory access throughput requirement of highperformance protocol engines. A new directory design called the complete and concise remote (CCR) directory, which contains roughly the same amount of memory as a sparse directory but retains the benefits of a full-map directory, is used. Everest also supports system partitioning and provides a tightly integrated facility for secure, high-performance communication between partitions. Simulation results for both technical and commercial applications exploring some of the Everest design space are presented. The results show that the features of the Everest architecture can have significant impact on the performance of distributed shared-memory servers.


international parallel and distributed processing symposium | 2000

Using switch directories to speed up cache-to-cache transfers in CC-NUMA multiprocessors

Ravi R. Iyer; Laxmi N. Bhuyan; Ashwini K. Nanda

In this paper we propose a novel hardware caching technique, called switch directory, to reduce the communication latency in CC-NUMA multiprocessors. The main idea is to implement small fast directory caches in crossbar switches of the inter-connect medium to capture and store ownership information as the data flows from the memory module to the requesting processor. Using the stored information, the switch directory re-routes subsequent requests to dirty blocks directly to the owner cache, thus reducing the latency for home node processing such as slow DRAM directory access and coherence controller occupancies. The design and implementation details of a DiRectory Embedded Switch ARchitecture; DRESAR, are presented. We explore the performance benefits of switch directories by modeling DRESAR in a detailed execution driven simulator. Our results show that the switch directories can improve performance by up to 60% reduction in home node cache-to-cache transfers for several scientific applications and commercial workloads.


Archive | 2003

High Performance Memory Systems

Haldun Hadimioglu; David R. Kaeli; Ashwini K. Nanda

* Introduction * Coherence, synchronization, and allocation * Power-aware, reliable, and reconfigurable memory * Software-based memory tuning * Architecture-based tuning * Workload considerations * Index


Ibm Journal of Research and Development | 2007

Speech recognition systems on the cell broadband engine processor

Yang Liu; Holger E. Jones; Sheila Vaidya; Michael P. Perrone; Borivoj Tydlitát; Ashwini K. Nanda

In this paper we describe our design, implementation, and initial results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine™ (Cell/B.E.) processor. Automated speech recognition decodes speech samples into plaintext (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Architecture. Identifying and exploiting these parallelism opportunities is challenging and critical to improving system performance. From our initial performance timings, we observed that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time-a channel density that is orders of magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E. processor-based speech recognition and will likely lead to the development of production speech systems using Cell/B.E. processor clusters.

Collaboration


Dive into the Ashwini K. Nanda's collaboration.

Researchain Logo
Decentralizing Knowledge