Zeshan Chishti | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zeshan Chishti is active.

Explore More

Publication

Featured researches published by Zeshan Chishti.

international symposium on computer architecture | 2005

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Zeshan Chishti; Michael D. Powell; T. N. Vijaykumar

Chip multiprocessors substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Through placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighbors caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a cores capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed non-uniform access with replacement and placement using distance associativity (NuRAPID) ro CMPs, and call our cache CMP-NuRAPID. Our result show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.

international symposium on microarchitecture | 2003

Distance associativity for high-performance energy-efficient non-uniform cache architectures

Zeshan Chishti; Michael D. Powell; T. N. Vijaykumar

Wire delays continue to grow as the dominant component of latency for large caches. A recent work proposed adaptive, non-uniform cache architecture (NUCA) to manage large, on-chip caches. By exploiting the variation in access time across widely-spaced subarrays, NUCA allows fast access to close subarrays while retaining slow access to far subarrays. While the idea of NUCA is attractive, NUCA does not employ design choices commonly used in large caches, such as sequential tag-data access for low power. Moreover, NUCA couples data placement with tag placement that is possible in a non-uniform access cache. Consequently, NUCA can place only a few blocks within a given cache set in the fastest subarrays, and must employ a high-bandwidth switched network to swap blocks within the cache for high performance. In this paper, we propose the non-uniform access with replacement and placement using distance associativity cache or NuRAPID, which leverages sequential tag-data access to decouple data placement from tag placement. Distance associativity, the placement of data at a certain distance (and latency), is separated from set associativity, the placement of tags within a set. This decoupling enables NuRAPID to place flexibly the vast majority of frequently-accessed data in the fastest subarrays, with fewer swaps than NUCA. Distance associativity fundamentally changes the trade-offs made by NUCAs best-performing design, resulting in higher performance and substantially lower cache energy. A one-ported, non-banked NuRAPID cache improves performance by 3% on average and up to 15% compared to a multi-banked NUCA with an infinite-bandwidth switched network, while reducing L2 cache energy by 77%.

international symposium on computer architecture | 2010

Reducing cache power with low-cost, multi-bit error-correcting codes

Chris Wilkerson; Alaa R. Alameldeen; Zeshan Chishti; Wei Wu; Dinesh Somasekhar; Shih-Lien Lu

Technology advancements have enabled the integration of large on-die embedded DRAM (eDRAM) caches. eDRAM is significantly denser than traditional SRAMs, but must be periodically refreshed to retain data. Like SRAM, eDRAM is susceptible to device variations, which play a role in determining refresh time for eDRAM cells. Refresh power potentially represents a large fraction of overall system power, particularly during low-power states when the CPU is idle. Future designs need to reduce cache power without incurring the high cost of flushing cache data when entering low-power states. In this paper, we show the significant impact of variations on refresh time and cache power consumption for large eDRAM caches. We propose Hi-ECC, a technique that incorporates multi-bit error-correcting codes to significantly reduce refresh rate. Multi-bit error-correcting codes usually have a complex decoder design and high storage cost. Hi-ECC avoids the decoder complexity by using strong ECC codes to identify and disable sections of the cache with multi-bit failures, while providing efficient single-bit error correction for the common case. Hi-ECC includes additional optimizations that allow us to amortize the storage cost of the code over large data words, providing the benefit of multi-bit correction at same storage cost as a single-bit error-correcting (SECDED) code (2% overhead). Our proposal achieves a 93% reduction in refresh power vs. a baseline eDRAM cache without error correcting capability, and a 66% reduction in refresh power vs. a system using SECDED codes.

international symposium on microarchitecture | 2009

Improving cache lifetime reliability at ultra-low voltages

Zeshan Chishti; Alaa R. Alameldeen; Chris Wilkerson; Wei Wu; Shih-Lien Lu

Voltage scaling is one of the most effective mechanisms to reduce microprocessor power consumption. However, the increased severity of manufacturing-induced parameter variations at lower voltages limits voltage scaling to a minimum voltage, Vccmin, below which a processor cannot operate reliably. Memory cell failures in large memory structures (e.g., caches) typically determine the Vccmin for the whole processor. Memory failures can be persistent (i.e., failures at time zero which cause yield loss) or non-persistent (e.g., soft errors or erratic bit failures). Both types of failures increase as supply voltage decreases and both need to be addressed to achieve reliable operation at low voltages. In this paper, we propose a novel adaptive technique to improve cache lifetime reliability and enable low voltage operation. This technique, multi-bit segmented ECC (MS-ECC) addresses both persistent and non-persistent failures. Like previous work on mitigating persistent failures, MS-ECC trades off cache capacity for lower voltages. However, unlike previous schemes, MS-ECC does not rely on testing to identify and isolate defective bits, and therefore enables error tolerance for non-persistent failures like erratic bits and soft errors at low voltages. Furthermore, MS-ECCs design can allow the operating system to adaptively change the cache size and ECC capability to adjust to system operating conditions. Compared to current designs with single-bit correction, the most aggressive implementation for MS-ECC enables a 30% reduction in supply voltage, reducing power by 71% and energy per instruction by 42%.

high-performance computer architecture | 2014

Improving DRAM performance by parallelizing refreshes with accesses

Kevin Kai-Wei Chang; Donghyuk Lee; Zeshan Chishti; Alaa R. Alameldeen; Chris Wilkerson; Yoongu Kim; Onur Mutlu

Modern DRAM cells are periodically refreshed to prevent data loss due to leakage. Commodity DDR (double data rate) DRAM refreshes cells at the rank level. This degrades performance significantly because it prevents an entire DRAM rank from serving memory requests while being refreshed. DRAM designed for mobile platforms, LPDDR (low power DDR) DRAM, supports an enhanced mode, called per-bank refresh, that refreshes cells at the bank level. This enables a bank to be accessed while another in the same rank is being refreshed, alleviating part of the negative performance impact of refreshes. Unfortunately, there are two shortcomings of per-bank refresh employed in todays systems. First, we observe that the perbank refresh scheduling scheme does not exploit the full potential of overlapping refreshes with accesses across banks because it restricts the banks to be refreshed in a sequential round-robin order. Second, accesses to a bank that is being refreshed have to wait. To mitigate the negative performance impact of DRAM refresh, we propose two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of per-bank refresh by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as it is done today, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP proactively schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Extensive evaluations on a wide variety of workloads and systems show that our mechanisms improve system performance (and energy efficiency) compared to three state-of-the-art refresh policies and the performance benefit increases as DRAM density increases.

international symposium on microarchitecture | 2014

Transparent Hardware Management of Stacked DRAM as Part of Memory

Jaewoong Sim; Alaa R. Alameldeen; Zeshan Chishti; Chris Wilkerson; Hyesoon Kim

Recent technology advancements allow for the integration of large memory structures on-die or as a die-stacked DRAM. Such structures provide higher bandwidth and faster access time than off-chip memory. Prior work has investigated using the large integrated memory as a cache, or using it as part of a heterogeneous memory system under management of the OS. Using this memory as a cache would waste a large fraction of total memory space, especially for the systems where stacked memory could be as large as off-chip memory. An OS managed heterogeneous memory system, on the other hand, requires costly usage-monitoring hardware to migrate frequently-used pages, and is often unable to capture pages that are highly utilized for short periods of time. This paper proposes a practical, low-cost architectural solution to efficiently enable using large fast memory as Part-of-Memory (PoM) seamlessly, without the involvement of the OS. Our PoM architecture effectively manages two different types of memory (slow and fast) combined to create a single physical address space. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits. Our proposed PoM architecture improves performance by 18.4% over static mapping and by 10.5% over an ideal OS-based dynamic remapping policy.

IEEE Transactions on Computers | 2011

Adaptive Cache Design to Enable Reliable Low-Voltage Operation

Alaa R. Alameldeen; Zeshan Chishti; Chris Wilkerson; Wei Wu; Shih-Lien Lu

The performance/energy trade-off is widely acknowledged as a primary design consideration for modern processors. A less discussed, though equally important, trade-off is the reliability/energy trade-off. Many design features that increase reliability (e.g., redundancy, error detection, and correction) have the side effect of consuming more energy. Many energy-saving features (e.g., voltage scaling) have the side effect of making systems less reliable. In this paper, we propose an adaptive cache design that enables the operating system to optimize for performance or energy efficiency without sacrificing reliability. Our proposed mechanism enables a cache with a wide operating range, where the cache can use a variable part of its data array to store error-correcting codes. A reliable, energy-efficient cache can use up to half of its data array to store error-correcting codes so that it can reliably operate at a low voltage to reduce energy. A reliable high-performance cache uses its whole data array, but operates at a higher voltage to improve reliability while sacrificing energy. We propose a hardware mechanism that allows the operating system to choose different points within that operating range based on the desired levels of performance, energy, and reliability.

IEEE Transactions on Computers | 2016

DRAM Refresh Mechanisms, Penalties, and Trade-Offs

Ishwar Bhati; Mu-Tien Chang; Zeshan Chishti; Shih-Lien Lu; Bruce Jacob

Ever-growing application data footprints demand faster main memory with larger capacity. DRAM has been the technology choice for main memory due to its low latency and high density. However, DRAM cells must be refreshed periodically to preserve their content. Refresh operations negatively affect performance and power. Traditionally, the performance and power overhead of refresh have been insignificant. But as the size and speed of DRAM chips continue to increase, refresh becomes a dominating factor of DRAM performance and power dissipation. In this paper, we conduct a comprehensive study of the issues related to refresh operations in modern DRAMs. Specifically, we describe the difference in refresh operations between modern synchronous DRAM and traditional asynchronous DRAM; the refresh modes and timings; and variations in data retention time. Moreover, we quantify refresh penalties versus device speed, size, and total memory capacity. We also categorize refresh mechanisms based on command granularity, and summarize refresh techniques proposed in research papers. Finally, based on our experiments and observations, we propose guidelines for mitigating DRAM refresh penalties.

high-performance computer architecture | 2014

Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers

Seth H. Pugsley; Zeshan Chishti; Chris Wilkerson; Peng Fei Chuang; Robert L. Scott; Aamer Jaleel; Shih Lien Lu; Kingsum Chow; Rajeev Balasubramonian

Memory latency is a major factor in limiting CPU performance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting performance. It is therefore important to employ prefetching mechanisms that use these resources prudently, while still prefetching required data in a timely manner. In this work, we propose a new mechanism to determine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching. Sandbox Prefetching evaluates simple, aggressive offset prefetchers at run-time by adding the prefetch address to a Bloom filter, rather than actually fetching the data into the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to see if the aggressive prefetcher under evaluation could have accurately prefetched the data, while simultaneously testing for the existence of prefetchable streams. Real prefetches are performed when the accuracy of evaluated prefetchers exceeds a threshold. This method combines the ideas of global pattern confirmation and immediate prefetching action to achieve high performance. Sandbox Prefetching improves performance across the tested workloads by 47.6% compared to not using any prefetching, and by 18.7% compared to the Feedback Directed Prefetching technique. Performance is also improved by 1.4% compared to the Access Map Pattern Matching Prefetcher, while incurring considerably less logic and storage overheads.

international symposium on microarchitecture | 2008

Shapeshifter: Dynamically changing pipeline width and speed to address process variations

Eric Chun; Zeshan Chishti; T. N. Vijaykumar

Process variations are a manufacturing phenomenon that result in some parameters of the transistors in a real chip to be different from those specified in the design. One impact of these variations is that the affected circuits may perform faster or slower than the design target. Unfortunately, only a small fraction of chips speed up whereas the vast majority incur slow downs. While die-to-die variations have been addressed by clock binning, within-die variations are increasing in importance with scaling. Clock binning in the presence of within-die variations results in slow clock speeds for dies with many components that can operate at higher clock speeds. A recent paper addressing within-die variations proposes variable-latency functional units and register file so that the fast instances of these components take fewer clock cycles to operate than the slower instances. However, in pipeline stages where the instances are interdependent, the fast instances would be held up by the slow instances. Also, variable latency may complicate timing-critical instruction scheduling. Instead of varying the number of clock cycles, we advocate varying the clock speed. Our scheme, called Shapeshifter, maintains high clock speeds during low-ILP program phases by using a narrower pipeline of only the faster instances, and reduces the clock speed only in the high-ILP phases which use all the instances. Shapeshifter simply turns off the slow instances, removing them from any interdependence among all the instances. Also, Shapeshifter requires minimal additions to the pipeline because almost all pipelines already support varying the clock speed for power management purposes. Using simulations, we show that Shapeshifter performs better than clock binning and the variable-latency approach.

Explore More