Reetuparna Das | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Reetuparna Das is active.

Explore More

Publication

Featured researches published by Reetuparna Das.

international symposium on computer architecture | 2007

A novel dimensionally-decomposed router for on-chip communication in 3D architectures

Jongman Kim; Chrysostomos Nicopoulos; Dongkook Park; Reetuparna Das; Yuan Xie; Vijaykrishnan Narayanan; Mazin S. Yousif; Chita R. Das

Much like multi-storey buildings in densely packed metropolises, three-dimensional (3D) chip structures are envisioned as a viable solution to skyrocketing transistor densities and burgeoning die sizes in multi-core architectures. Partitioning a larger die into smaller segments and then stacking them in a 3D fashion can significantly reduce latency and energy consumption. Such benefits emanate from the notion that inter-wafer distances are negligible compared to intra-wafer distances. This attribute substantially reduces global wiring length in 3D chips. The work in this paper integrates the increasingly popular idea of packet-based Networks-on-Chip (NoC) into a 3D setting. While NoCs have been studied extensively in the 2D realm, the microarchitectural ramifications of moving into the third dimension have yet to be fully explored. This paper presents a detailed exploration of inter-strata communication architectures in 3D NoCs. Three design options are investigated; a simple bus-based inter-wafer connection, a hop-by-hop standard 3D design, and a full 3D crossbar implementation. In this context, we propose a novel partially-connected 3D crossbar structure, called the 3D Dimensionally-Decomposed (DimDe) Router, which provides a good tradeoff between circuit complexity and performance benefits. Simulation results using (a) a stand-alone cycle-accurate 3D NoC simulator running synthetic workloads, and (b) a hybrid 3D NoC/cache simulation environment running real commercial and scientific benchmarks, indicate that the proposed DimDe design provides latency and throughput improvements of over 20% on average over the other 3D architectures, while remaining within 5% of the full 3D crossbar performance. Furthermore, based on synthesized hardware implementations in 90 nm technology, the DimDe architecture outperforms all other designs -- including the full 3D crossbar -- by an average of 26% in terms of the Energy-Delay Product (EDP).

high-performance computer architecture | 2009

Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs

Reetuparna Das; Soumya Eachempati; Asit K. Mishra; Vijaykrishnan Narayanan; Chita R. Das

Performance and power consumption of an on-chip interconnect that forms the backbone of Chip Multiprocessors (CMPs), are directly influenced by the underlying network topology. Both these parameters can also be optimized by application induced communication locality since applications mapped on a large CMP system will benefit from clustered communication, where data is placed in cache banks closer to the cores accessing it. Thus, in this paper, we design a hierarchical network topology that takes advantage of such communication locality. The two-tier hierarchical topology consists of local networks that are connected via a global network. The local network is a simple, high-bandwidth, low-power shared bus fabric, and the global network is a low-radix mesh. The key insight that enables the hybrid topology is that most communication in CMP applications can be limited to the local network, and thus, using a fast, low-power bus to handle local communication will improve both packet latency and power-efficiency. The proposed hierarchical topology provides up to 63% reduction in energy-delay-product over mesh, 47% over flattened butterfly, and 33% with respect to concentrated mesh across network sizes with uniform and non-uniform synthetic traffic. For real parallel workloads, the hybrid topology provides up to 14% improvement in system performance (IPC) and in terms of energy-delay-product, improvements of 70%, 22%, 30% over the mesh, flattened butterfly, and concentrated mesh, respectively, for a 32-way CMP. Although the hybrid topology scales in a power- and bandwidth-efficient manner with network size, while keeping the average packet latency low in comparison to high radix topologies, it has lower throughput due to high concentration. To improve the throughput of the hybrid topology, we propose a novel router micro-architecture, called XShare, which exploits data value locality and bimodal traffic characteristics of CMP applications to transfer multiple small flits over a single channel. This helps in enhancing the network throughput by 35%, providing a latency reduction of 14% with synthetic traffic, and improving IPC on an average 4% with application workloads.

international symposium on microarchitecture | 2012

Composite Cores: Pushing Heterogeneity Into a Core

Andrew Lukefahr; Shruti Padmanabha; Reetuparna Das; Faissal M. Sleiman; Ronald G. Dreslinski; Thomas F. Wenisch; Scott A. Mahlke

Heterogeneous multicore systems -- comprised of multiple cores with varying capabilities, performance, and energy characteristics -- have emerged as a promising approach to increasing energy efficiency. Such systems reduce energy consumption by identifying phase changes in an application and migrating execution to the most efficient core that meets its current performance requirements. However, due to the overhead of switching between cores, migration opportunities are limited to coarse-grained phases (hundreds of millions of instructions), reducing the potential to exploit energy efficient cores. We propose Composite Cores, an architecture that reduces switching overheads by bringing the notion of heterogeneity within a single core. The proposed architecture pairs big and little compute µEngines that together can achieve high performance and energy efficiency. By sharing much of the architectural state between the µEngines, the switching overhead can be reduced to near zero, enabling fine-grained switching and increasing the opportunities to utilize the little µEngine without sacrificing performance. An intelligent controller switches between the µEngines to maximize energy efficiency while constraining performance loss to a configurable bound. We evaluate Composite Cores using cycle accurate micro architectural simulations and a detailed power model. Results show that, on average, the controller is able to map 25% of the execution to the little µEngine, achieving an 18% energy savings while limiting performance loss to 5%.

international symposium on microarchitecture | 2009

A case for dynamic frequency tuning in on-chip networks

Asit K. Mishra; Reetuparna Das; Soumya Eachempati; Ravishankar R. Iyer; Narayanan Vijaykrishnan; Chita R. Das

Performance and power are the first order design metrics for network-on-chips (NoCs) that have become the de-facto standard in providing scalable communication backbones for multicores/CMPs. However, NoCs can be plagued by higher power consumption and degraded throughput if the network and router are not designed properly. Towards this end, this paper proposes a novel router architecture, where we tune the frequency of a router in response to network load to manage both performance and power. We propose three dynamic frequency tuning techniques, FreqBoost, FreqThrtl and FreqTune, targeted at congestion and power management in NoCs. As enablers for these techniques, we exploit Dynamic Voltage and Frequency Scaling (DVFS) and the imbalance in a generic router pipeline through time stealing. Experiments using synthetic workloads on a 8x8 wormhole-switched mesh interconnect show that FreqBoost is a better choice for reducing average latency (maximum 40%) while, FreqThrtl provides the maximum benefits in terms of power saving and energy delay product (EDP). The FreqTune scheme is a better candidate for optimizing both performance and power, achieving on an average 36% reduction in latency, 13% savings in power (up to 24% at high load), and 40% savings (up to 70% at high load) in EDP. With application benchmarks, we observe IPC improvement up to 23% using our design. The performance and power benefits also scale for larger NoCs.

international symposium on computer architecture | 2013

Catnap: energy proportional multiple network-on-chip

Reetuparna Das; Satish Narayanasamy; Sudhir Satpathy; Ronald G. Dreslinski

Multiple networks have been used in several processor implementations to scale bandwidth and ensure protocol-level deadlock freedom for different message classes. In this paper, we observe that a multiple-network design is also attractive from a power perspective and can be leveraged to achieve energy proportionality by effective power gating. Unlike a single-network design, a multiple-network design is more amenable to power gating, as its subnetworks (subnets) can be power gated without compromising the connectivity of the network. To exploit this opportunity, we propose the Catnap architecture which consists of synergistic subnet selection and power-gating policies. Catnap maximizes the number of consecutive idle cycles in a router, while avoiding performance loss due to overloading a subnet. We evaluate a 256-core processor with a concentrated mesh topology using synthetic traffic and 35 applications. We show that the average network power of a power-gating optimized multiple-network design with four subnets could be 44% lower than a bandwidth equivalent single-network design for an average performance cost of about 5%.

international conference on parallel architectures and compilation techniques | 2012

Application-to-core mapping policies to reduce memory interference in multi-core systems

Reetuparna Das; Rachata Ausavarungnirun; Onur Mutlu; Akhilesh Kumar; Mani Azimi

How applications running on a many-core system are mapped to cores largely determines the interference between these applications in critical shared resources. This paper proposes application-to-core mapping policies to improve system performance by reducing inter-application interference in the on-chip network and memory controllers. The major new ideas of our policies are to: 1) map network-latency-sensitive applications to separate parts of the network from network-bandwidth-intensive applications such that the former can make fast progress without heavy interference from the latter, 2) map those applications that benefit more from being closer to the memory controllers close to these resources. Our evaluations show that both ideas significantly improve system throughput, fairness and interconnect power efficiency.

high-performance computer architecture | 2008

Performance and power optimization through data compression in Network-on-Chip architectures

Reetuparna Das; Asit K. Mishra; Chrysostomos Nicopoulos; Dongkook Park; Vijaykrishnan Narayanan; Ravishankar R. Iyer; Mazin S. Yousif; Chita R. Das

The trend towards integrating multiple cores on the same die has accentuated the need for larger on-chip caches. Such large caches are constructed as a multitude of smaller cache banks interconnected through a packet-based network-on-chip (NoC) communication fabric. Thus, the NoC plays a critical role in optimizing the performance and power consumption of such non-uniform cache-based multicore architectures. While almost all prior NoC studies have focused on the design of router microarchitectures for achieving this goal, in this paper, we explore the role of data compression on NoC performance and energy behavior. In this context, we examine two different configurations that explore combinations of storage and communication compression: (1) Cache compression (CC) and (2) Compression in the NIC (NC). We also address techniques to hide the decompression latency by overlapping with NoC communication latency. Our simulation results with a diverse set of scientific and commercial benchmark traces reveal that CC can provide up to 33% reduction in network latency and up to 23% power savings. Even in the case of NC - where the data is compressed only when passing through the NoC fabric of the NUCA architecture and stored uncompressed - performance and power savings of up to 32% and 21%, respectively, can be obtained. These performance benefits in the interconnect translate up to 17% reduction in CPI. These benefits are orthogonal to any router architecture and make a strong case for utilizing compression for optimizing the performance and power envelope of NoC architectures. In addition, the study demonstrates the criticality of designing faster routers in shaping the performance behavior.

high performance interconnects | 2007

Design of a Dynamic Priority-Based Fast Path Architecture for On-Chip Interconnects

Dongkook Park; Reetuparna Das; Chrysostomos Nicopoulos; Jongman Kim; Narayanan Vijaykrishnan; Ravishankar R. Iyer; Chita R. Das

In modern multi-core system-on-chip (SoC) architectures, the design of innovative interconnection fabrics is indispensable. The concept of the network-on-chip (NoC) architecture has been proposed recently to better suit this requirement. Especially, the router architecture has a significant effect on the overall performance and energy consumption of the chip. We propose a dynamic path management scheme that exploits network traffic information during switch arbitration. Consequently, flits transferred across frequently used paths are expedited by traversing a reduced router pipeline. This technique, based on pipeline bypassing, is simulated and evaluated in terms of network latency and average power consumption. Simulation results with real-world application traces show that the architecture improves the performance up to 30% while incurring only minimal area/power overhead.

high-performance computer architecture | 2013

Application-to-core mapping policies to reduce memory system interference in multi-core systems

Reetuparna Das; Rachata Ausavarungnirun; Onur Mutlu; Akhilesh Kumar; Mani Azimi

Future many-core processors are likely to concurrently execute a large number of diverse applications. How these applications are mapped to cores largely determines the interference between these applications in critical shared hardware resources. This paper proposes new application-to-core mapping policies to improve system performance by reducing inter-application interference in the on-chip network and memory controllers. The major new ideas of our policies are to: 1) map network-latency-sensitive applications to separate parts of the network from network-bandwidth-intensive applications such that the former can make fast progress without heavy interference from the latter, 2) map those applications that benefit more from being closer to the memory controllers close to these resources. Our evaluations show that, averaged over 128 multiprogrammed workloads of 35 different benchmarks running on a 64-core system, our final application-to-core mapping policy improves system throughput by 16.7% over a state-of-the-art baseline, while also reducing system unfairness by 22.4% and average interconnect power consumption by 52.3%.

IEEE Journal on Emerging and Selected Topics in Circuits and Systems | 2012

Swizzle-Switch Networks for Many-Core Systems

Korey Sewell; Ronald G. Dreslinski; Thomas Manville; Sudhir Satpathy; Nathaniel Ross Pinckney; Geoffrey Blake; Michael Cieslak; Reetuparna Das; Thomas F. Wenisch; Dennis Sylvester; David T. Blaauw; Trevor N. Mudge

This work revisits the design of crossbar and high-radix interconnects in light of advances in circuit and layout techniques that improve crossbar scalability, obviating the need for deep multi-stage networks. We employ a new building block, the Swizzle-Switch-an energy and area-efficient switching element that can readily scale to radix 64-that has recently been validated via silicon test chips in 45 nm technology. We evaluate the Swizzle-Switch as both the high-radix building block of a Flattened Butterfly and as a single-stage interconnect, the Swizzle-Switch Network. In the process we address the architectural and layout challenges associated with centralized crossbar systems. Compared to a conventional Mesh, the Flattened Butterfly provides a 15% performance improvement with a 2.5× reduction in the standard deviation of on-chip access times. The Swizzle-Switch Network achieves further gains, providing a 21% improvement in performance, a 3× reduction in on-chip access variability, a 33% reduction in interconnect power, and a 25% reduction in total system energy while only increasing chip area by 7%. Finally, this paper details a 3-D integrated version of the Swizzle-Switch Network, showing up to a 30% gain in performance over the 2-D Swizzle-Switch Network for benchmarks sensitive to interconnect latency. One major concern with 3-D designs is thermal dissipation. We show through detailed thermal analysis that with the highly energy-efficient Swizzle-Switch Network design that the thermal budget is well within that of passive cooling solutions.

Explore More