Niladrish Chatterjee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Niladrish Chatterjee is active.

Explore More

Publication

Featured researches published by Niladrish Chatterjee.

international symposium on computer architecture | 2010

Rethinking DRAM design and organization for energy-constrained multi-cores

Aniruddha N. Udipi; Naveen Muralimanohar; Niladrish Chatterjee; Rajeev Balasubramonian; Al Davis; Norman P. Jouppi

DRAM vendors have traditionally optimized the cost-per-bit metric, often making design decisions that incur energy penalties. A prime example is the overfetch feature in DRAM, where a single request activates thousands of bit-lines in many DRAM chips, only to return a single cache line to the CPU. The focus on cost-per-bit is questionable in modern-day servers where operating costs can easily exceed the purchase cost. Modern technology trends are also placing very different demands on the memory system: (i)queuing delays are a significant component of memory access time, (ii) there is a high energy premium for the level of reliability expected for business-critical computing, and (iii) the memory access stream emerging from multi-core systems exhibits limited locality. All of these trends necessitate an overhaul of DRAM architecture, even if it means a slight compromise in the cost-per-bit metric. This paper examines three primary innovations. The first is a modification to DRAM chip microarchitecture that re tains the traditional DDRx SDRAMinterface. Selective Bit-line Activation (SBA) waits for both RAS (row address) and CAS (column address) signals to arrive before activating exactly those bitlines that provide the requested cache line. SBA reduces energy consumption while incurring slight area and performance penalties. The second innovation, Single Subarray Access (SSA), fundamentally re-organizes the layout of DRAM arrays and the mapping of data to these arrays so that an entire cache line is fetched from a single subarray. It requires a different interface to the memory controller, reduces dynamic and background energy (by about 6X), incurs a slight area penalty (4%), and can even lead to performance improvements (54% on average) by reducing queuing delays. The third innovation further penalizes the cost-per-bit metric by adding a checksum feature to each cache line. This checksum error-detection feature can then be used to build stronger RAID-like fault tolerance, including chipkill-level reliability. Such a technique is especially crucial for the SSA architecture where the entire cache line is localized to a single chip. This DRAM chip microarchitectural change leads to a dramatic reduction in the energy and storage overheads for reliability. The proposed architectures will also apply to other emerging memory technologies (such as resistive memories) and will be less disruptive to standards, interfaces, and the design flow if they can be incorporated into first-generation designs.

international symposium on microarchitecture | 2012

Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access

Niladrish Chatterjee; Manjunath Shevgoor; Rajeev Balasubramonian; Al Davis; Zhen Fang; Ramesh Illikkal; Ravi R. Iyer

The DRAM main memory system in modern servers is largely homogeneous. In recent years, DRAM manufacturers have produced chips with vastly differing latency and energy characteristics. This provides the opportunity to build a heterogeneous main memory system where different parts of the address space can yield different latencies and energy per access. The limited prior work in this area has explored smart placement of pages with high activities. In this paper, we propose a novel alternative to exploit DRAM heterogeneity. We observe that the critical word in a cache line can be easily recognized beforehand and placed in a low-latency region of the main memory. Other non-critical words of the cache line can be placed in a low-energy region. We design an architecture that has low complexity and that can accelerate the transfer of the critical word by tens of cycles. For our benchmark suite, we show an average performance improvement of 12.9% and an accompanying memory energy reduction of 15%.

high performance computer architecture | 2012

Staged Reads: Mitigating the impact of DRAM writes on DRAM reads

Niladrish Chatterjee; Naveen Muralimanohar; Rajeev Balasubramonian; Al Davis; Norman P. Jouppi

Main memory latencies have always been a concern for system performance. Given that reads are on the critical path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single channel in the main memory system offers almost no parallelism between reads and writes. This is because a single off-chip memory bus is shared by reads and writes and the direction of the bus has to be explicitly turned around when switching from writes to reads. This is an expensive operation and its cost is amortized by carrying out a burst of writes or reads every time the bus direction is switched. As a result, no reads can be processed while a memory channel is busy servicing writes. This paper proposes a novel mechanism to boost read-write parallelism and perform useful components of read operations even when the memory system is busy performing writes. If some of the banks are busy servicing writes, we start issuing reads to the other idle banks. The results of these reads are stored in a few registers near the memory chips I/O pads. These results are quickly returned immediately following the bus turnaround. The process is referred to as a Staged Read because it decouples a single read operation into two stages, with the first step being performed in parallel with writes. This innovation can also be viewed as a form of prefetch that is internal to a memory chip. The proposed technique works best when there is bank imbalance in the write stream. We also introduce a write scheduling algorithm that artificially creates bank imbalance and allows useful read operations to be performed during the write drain. Across a suite of memory-intensive workloads, we show that Staged Reads can boost throughput by up to 33% (average 7%) with an average DRAM access latency improvement of 17%, while incurring a very small cost (0.25%) in terms of memory chip area. The throughput improvements are even greater when considering write-intensive workloads (average 11%) or future systems (average 12%).

international symposium on computer architecture | 2016

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

Kevin Hsieh; Eiman Ebrahimi; Gwangsun Kim; Niladrish Chatterjee; Mike O'Connor; Nandita Vijaykumar; Onur Mutlu; Stephen W. Keckler

Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer. Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping. Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.

ieee international conference on high performance computing data and analytics | 2014

Managing DRAM latency divergence in irregular GPGPU applications

Niladrish Chatterjee; Mike O'Connor; Gabriel H. Loh; Nuwan Jayasena; Rajeev Balasubramonian

Memory controllers in modern GPUs aggressively reorder requests for high bandwidth usage, often interleaving requests from different warps. This leads to high variance in the latency of different requests issued by the threads of a warp. Since a warp in a SIMT architecture can proceed only when all of its memory requests are returned by memory, such latency divergence causes significant slowdown when running irregular GPGPU applications. To solve this issue, we propose memory scheduling mechanisms that avoid inter-warp interference in the DRAM system to reduce the average memory stall latency experienced by warps. We further reduce latency divergence through mechanisms that coordinate scheduling decisions across multiple independent memory channels. Finally we show that carefully orchestrating the memory scheduling policy can achieve low average latency for warps, without compromising bandwidth utilization. Our combined scheme yields a 10.1% performance improvement for irregular GPGPU workloads relative to a throughput-optimized GPU memory controller.

Proceedings of the 2015 International Symposium on Memory Systems | 2015

Anatomy of GPU Memory System for Multi-Application Execution

Adwait Jog; Onur Kayiran; Tuba Kesten; Ashutosh Pattnaik; Evgeny Bolotin; Niladrish Chatterjee; Stephen W. Keckler; Mahmut T. Kandemir; Chita R. Das

As GPUs make headway in the computing landscape spanning mobile platforms, supercomputers, cloud and virtual desktop platforms, supporting concurrent execution of multiple applications in GPUs becomes essential for unlocking their full potential. However, unlike CPUs, multi-application execution in GPUs is little explored. In this paper, we study the memory system of GPUs in a concurrently executing multi-application environment. We first present an analytical performance model for many-threaded architectures and show that the common use of misses-per-kilo-instruction (MPKI) as a proxy for performance is not accurate without considering the bandwidth usage of applications. We characterize the memory interference of applications and discuss the limitations of existing memory schedulers in mitigating this interference. We extend the analytical model to multiple applications and identify the key metrics to control various performance metrics. We conduct extensive simulations using an enhanced version of GPGPU-Sim targeted for concurrently executing multiple applications, and show that memory scheduling decisions based on MPKI and bandwidth information are more effective in enhancing throughput compared to the traditional FR-FCFS and the recently proposed RR FR-FCFS policies.

measurement and modeling of computer systems | 2017

Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms

Kevin Kai-Wei Chang; Abdullah Giray Yağlıkçı; Saugata Ghose; Aditya Agrawal; Niladrish Chatterjee; Abhijith Kashyap; Donghyuk Lee; Mike O'Connor; Hasan Hassan; Onur Mutlu

The energy consumption of DRAM is a critical concern in modern computing systems. Improvements in manufacturing process technology have allowed DRAM vendors to lower the DRAM supply voltage conservatively, which reduces some of the DRAM energy consumption. We would like to reduce the DRAM supply voltage more aggressively, to further reduce energy. Aggressive supply voltage reduction requires a thorough understanding of the effect voltage scaling has on DRAM access latency and DRAM reliability. In this paper, we take a comprehensive approach to understanding and exploiting the latency and reliability characteristics of modern DRAM when the supply voltage is lowered below the nominal voltage level specified by manufacturers.

2010 IEEE 4th International Conference on Internet Multimedia Services Architecture and Application | 2010

A scalable location-based services infrastructure combining GPS and Bluetooth based positioning for providing services in ubiquitous environment

Pampa Sadhukhan; Niladrish Chatterjee; Arijit Das; Pradip Kumar Das

Several methodologies on the provisioning of the Location-Based Services (LBSs) solely using GPS-based positioning or combining several positioning techniques like GPS, GSM Cell-ID, Wi-Fi, Bluetooth and RFID etc. have been proposed over the past few years. Most of these systems do not focus on limited battery power, memory-constraint of the mobile devices, the heterogeneity of mobile platform and proper authentication checks before providing LBSs to the mobile user. Our proposed system utilizes low-cost, low-power Bluetooth wireless technology as indoor positioning technique and combines it with GPS-based positioning to accurately sense the location in outdoor environment. The inclusion of HTML-to-WML parser into the Middleware deployed on each Base Station (BS) enables the devices with micro browser to invoke the LBSs properly. The mobile client application developed in Java is portable and interoperable with a diverse set of mobile platforms. The client application can also run on devices not having support of location API. Scatternet support with our proposed LBS Infrastructure makes it scalable with increasing number of client devices. This paper also evaluates the performance of the Middleware in terms of connection setup time and service consumption time with respect to varying number of mobile clients.

high-performance computer architecture | 2017

Architecting an Energy-Efficient DRAM System for GPUs

Niladrish Chatterjee; Mike O'Connor; Donghyuk Lee; Daniel R. Johnson; Stephen W. Keckler; Minsoo Rhu; William J. Dally

This paper proposes an energy-efficient, high-throughput DRAM architecture for GPUs and throughput processors. In these systems, requests from thousands of concurrent threads compete for a limited number of DRAM row buffers. As a result, only a fraction of the data fetched into a row buffer is used, leading to significant energy overheads. Our proposed DRAM architecture exploits the hierarchical organization of a DRAM bank to reduce the minimum row activation granularity. To avoid significant incremental area with this approach, we must partition the DRAM datapath into a number of semi-independent subchannels. These narrow subchannels increase data toggling energy which we mitigate using a static data reordering scheme designed to lower the toggle rate. This design has 35% lower energy consumption than a die-stacked DRAM with 2.6% area overhead. The resulting architecture, when augmented with an improved memory access protocol, can support parallel operations across the semi-independent subchannels, thereby improving system performance by 13% on average for a range of workloads.

international symposium on performance analysis of systems and software | 2016

Addressing service interruptions in memory with thread-to-rank assignment

Manjunath Shevgoor; Rajeev Balasubramonian; Niladrish Chatterjee; Jung-Sik Kim

In future memory systems, some regions of memory will be periodically unavailable to the processor. In DRAM systems, this may happen because a rank is busy performing refresh. In non-volatile memory systems, this may happen because a rank is busy draining long-latency writes. Unfortunately, such service interruptions can introduce stalls in all running threads. This is because the operating system spreads the pages of a thread across all memory ranks. Therefore, the probability of a thread accessing data in an unavailable rank is high. This is a performance artifact that has previously not been carefully analyzed. To reduce these stalls, we propose a simple page coloring mechanism that tries to minimize the number of ranks over which a threads pages are spread. This approach ensures that a service interruption in a single rank only stalls a subset of threads; non-stalled threads even have the potential to run faster at this time because of reduced bus contention. Our analysis shows that this approach is more effective than recent hardware-based mechanisms to deal with such service interruptions. For example, when dealing with service interruptions because of DRAM refresh, the proposed page coloring approach yields an execution time that is 15% lower than the best competing hardware approach.

Explore More