Ranjani Narayan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ranjani Narayan is active.

Explore More

Publication

Featured researches published by Ranjani Narayan.

ACM Transactions in Embedded Computing Systems | 2009

REDEFINE: Runtime reconfigurable polymorphic ASIC

Mythri Alle; Keshavan Varadarajan; Alexander Fell; C. Ramesh Reddy; Nimmy Joseph; Saptarsi Das; Prasenjit Biswas; Jugantor Chetia; Adarsh Rao; S. K. Nandy; Ranjani Narayan

Emerging embedded applications are based on evolving standards (e.g., MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between different standards and their derivatives. In order to achieve high resource utilization and low power dissipation, we propose REDEFINE, a polymorphic ASIC in which specialized hardware units are replaced with basic hardware units that can create the same functionality by runtime re-composition. It is a “future-proof” custom hardware solution for multiple applications and their derivatives in a domain. In this article, we describe a compiler framework and supporting hardware comprising compute, storage, and communication resources. Applications described in high-level language (e.g., C) are compiled into application substructures. For each application substructure, a set of compute elements on the hardware are interconnected during runtime to form a pattern that closely matches the communication pattern of that particular application. The advantage is that the bounded CEs are neither processor cores nor logic elements as in FPGAs. Hence, REDEFINE offers the power and performance advantage of an ASIC and the hardware reconfigurability and programmability of that of an FPGA/instruction set processor. In addition, the hardware supports custom instruction pipelining. Existing instruction-set extensible processors determine a sequence of instructions that repeatedly occur within the application to create custom instructions at design time to speed up the execution of this sequence. We extend this scheme further, where a kernel is compiled into custom instructions that bear strong producer-consumer relationship (and not limited to frequently occurring sequences of instructions). Custom instructions, realized as hardware compositions effected at runtime, allow several instances of the same to be active in parallel. A key distinguishing factor in majority of the emerging embedded applications is stream processing. To reduce the overheads of data transfer between custom instructions, direct communication paths are employed among custom instructions. In this article, we present the overview of the hardware-aware compiler framework, which determines the NoC-aware schedule of transports of the data exchanged between the custom instructions on the interconnect. The results for the FFT kernel indicate a 25% reduction in the number of loads/stores, and throughput improves by log(n) for n-point FFT when compared to sequential implementation. Overall, REDEFINE offers flexibility and a runtime reconfigurability at the expense of 1.16× in power and 8× in area when compared to an ASIC. REDEFINE implementation consumes 0.1× the power of an FPGA implementation. In addition, the configuration overhead of the FPGA implementation is 1,000× more than that of REDEFINE.

applied reconfigurable computing | 2009

Compiling Techniques for Coarse Grained Runtime Reconfigurable Architectures

Mythri Alle; Keshavan Varadarajan; Alexander Fell; S. K. Nandy; Ranjani Narayan

In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer relationship. These HyperOps are hosted on computation structures that are provisioned on demand at runtime. We also report compiler optimizations that collectively reduce the overheads of data-driven computations in runtime reconfigurable architectures. On an average, HyperOps offer a 44% reduction in total execution time and a 18% reduction in management overheads as compared to using basic blocks as coarse grained operations. We show that HyperOps formed using our compiler are suitable to support data flow software pipelining.

application specific systems architectures and processors | 2008

RECONNECT: A NoC for polymorphic ASICs using a low overhead single cycle router

Joseph Nimmy; C. Ramesh Reddy; Keshavan Varadarajan; Mythri Alle; Alexander Fell; S. K. Nandy; Ranjani Narayan

A polymorphic ASIC is a runtime reconfigurable hardware substrate comprising compute and communication elements. It is a ldquofuture proofrdquo custom hardware solution for multiple applications and their derivatives in a domain. Interoperability between application derivatives at runtime is achieved through hardware reconfiguration. In this paper we present the design of a single cycle Network on Chip (NoC) router that is responsible for effecting runtime reconfiguration of the hardware substrate. The router design is optimized to avoid FIFO buffers at the input port and loop back at output crossbar. It provides virtual channels to emulate a non-blocking network and supports a simple X-Y relative addressing scheme to limit the control overhead to 9 bits per packet. The 8times8 honeycomb NoC (RECONNECT) implemented in 130 nm UMC CMOS standard cell library operates at 500 MHz and has a bisection bandwidth of 28.5 GBps. The network is characterized for random, self-similar and application specific traffic patterns that model the execution of multimedia and DSP kernels with varying network loads and virtual channels. Our implementation with 4 virtual channels has an average network latency of 24 clock cycles and throughput of 62.5% of the network capacity for random traffic. For application specific traffic the latency is 6 clock cycles and throughput is 87% of the network capacity.

application specific systems architectures and processors | 2008

Synthesis of application accelerators on Runtime Reconfigurable Hardware

Mythri Alle; Keshavan Varadarajan; Ramesh C. Ramesh; Joseph Nimmy; Alexander Fell; Adarsha Rao; S. K. Nandy; Ranjani Narayan

Application accelerators are predominantly ASICs. The cost of ASIC solutions are order of magnitudes higher than programmable processing cores. Despite this, ASIC solutions are preferred when both high performance and low power is the target. ASICs offer no flexibility in terms of it being able to cater to application derivatives, unless this has been provisioned for at the time of design. In this paper we define the architecture of Runtime Reconfigurable Hardware (RRH) as the platform for application acceleration. The proposed RRH is a homogeneous fabric comprising computing, storage and communicating resources. We also propose a synthesis methodology to realize application written a high level language (HLL) on the RRH. Applications described in HLL is compiled into application substructures. For each application substructure a set of Compute Elements interconnected in a manner that closely matches the communication pattern within it, is allocated. CEs in such a configuration is called a hardware affine. Hardware Affines are carved out on the RRH at runtime. These hardware affines are defined at compile time, and are provisioned at runtime on the fabric. By virtue of the fact that these hardware affines are NOT instruction set processor cores or Logic Elements as in FPGAs, we bear the performance and power advantage of an ASIC, and the hardware reconfigurability/programmability of that of an FPGA/Instruction Set Processor.

symposium on cloud computing | 2009

Generic routing rules and a scalable access enhancement for the Network-on-Chip RECONNECT

Alexander Fell; Prasenjit Biswas; Jugantor Chetia; S. K. Nandy; Ranjani Narayan

RECONNECT is a Network-on-Chip using a honeycomb topology. In this paper we focus on properties of general rules applicable to a variety of routing algorithms for the NoC which take into account the missing links of the honeycomb topology when compared to a mesh. We also extend the original proposal [5] and show a method to insert and extract data to and from the network. Access Routers at the boundary of the execution fabric establish connections to multiple periphery modules and create a torus to decrease the node distances. Our approach is scalable and ensures homogeneity among the compute elements in the NoC. We synthesized and evaluated the proposed enhancement in terms of power dissipation and area. Our results indicate that the impact of necessary alterations to the fabric is negligible and effects the data transfer between the fabric and the periphery only marginally.

compilers, architecture, and synthesis for embedded systems | 2009

Streaming FFT on REDEFINE-v2: an application-architecture design space exploration

Alexander Fell; Mythri Alle; Keshavan Varadarajan; Prasenjit Biswas; Saptarsi Das; Jugantor Chetia; S. K. Nandy; Ranjani Narayan

In this paper we explore an implementation of a high-throughput, streaming application on REDEFINE-v2, which is an enhancement of REDEFINE. REDEFINE is a polymorphic ASIC combining the flexibility of a programmable solution with the execution speed of an ASIC. In REDEFINE Compute Elements are arranged in an 8x8 grid connected via a Network on Chip (NoC) called RECONNECT, to realize the various macrofunctional blocks of an equivalent ASIC. For a 1024-FFT we carry out an application-architecture design space exploration by examining the various characterizations of Compute Elements in terms of the size of the instruction store. We further study the impact by using application specific, vectorized FUs. By setting up different partitions of the FFT algorithm for persistent execution on REDEFINE-v2, we derive the benefits of setting up pipelined execution for higher performance. The impact of the REDEFINE-v2 micro-architecture for any arbitrary N-point FFT (N > 4096) FFT is also analyzed. We report the various algorithm-architecture tradeoffs in terms of area and execution speed with that of an ASIC implementation. In addition we compare the performance gain with respect to a GPP.

international conference on vlsi design | 2015

Micro-architectural Enhancements in Distributed Memory CGRAs for LU and QR Factorizations

Farhad Merchant; Arka Maity; Mahesh Mahadurkar; Kapil Vatwani; Ishan Munje; Madhava Krishna; S. Nalesh; Nandhini Gopalan; Soumyendu Raha; S. K. Nandy; Ranjani Narayan

LU and QR factorizations are the computationally dear part of many applications ranging from large scale simulations (e.g. Computational fluid dynamics) to augmented reality. These factorizations exhibit time complexity of O (n3) and are difficult to accelerate due to presence of bandwidth bound kernels, BLAS-1 or BLAS-2 (level-1 or level-2 Basic Linear Algebra Subprograms) along with compute bound kernels (BLAS-3, level-3 BLAS). On the other hand, Coarse Grained Reconfigurable Architectures (CGRAs) have gained tremendous popularity as accelerators in embedded systems due to their flexibility and ease of use. Provisioning these accelerators in High Performance Computing (HPC) platforms is the research challenge wrestled by the computer scientists. We consider a CGRA environment in which several Compute Elements (CEs) enhanced with Custom Functional Units (CFUs) are interconnected over a Network-on-Chip (NoC). In this paper, we carry out extensive micro-architectural exploration for accelerating core kernels like Matrix Multiplication (MM) (BLAS-3) for LU and QR factorizations. Our 5 different design enhancements lead to the reduction in the latency of BLAS-3 kernels. On a stand-alone CFU, we achieve up to 8x speed-up for MM. A commensurate improvement is observed for MM in a CGRA environment. We achieve better GF LOP S/mm2 compared to recent implementations.

international conference on embedded computer systems architectures modeling and simulation | 2014

Synthesis of Instruction Extensions on HyperCell, a reconfigurable datapath

Kavitha T. Madhu; Saptarsi Das; C. Madhava Krishna; S. Nalesh; S. K. Nandy; Ranjani Narayan

In this paper we present HyperCell as a reconfigurable datapath for Instruction Extensions (IEs). HyperCell comprises an array of compute units laid over a switch network. We present an IE synthesis methodology that enables post-silicon realization of IE datapaths on HyperCell. The synthesis methodology optimally exploits hardware resources in HyperCell to enable software pipelined execution of IEs. Exploitation of temporal reuse of data in HyperCell results in significant reduction of input/output bandwidth requirements of HyperCell.

Journal of Systems Architecture | 2014

A framework for post-silicon realization of arbitrary instruction extensions on reconfigurable data-paths

Saptarsi Das; Kavitha T. Madhu; Madhav Krishna; Nalesh Sivanandan; Farhad Merchant; Santhi Natarajan; Ipsita Biswas; Adithya Pulli; S. K. Nandy; Ranjani Narayan

In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of resources on the reconfigurable data-path. In this context we present the techniques used to realize IEs for applications that demand high throughput or those that must process data streams. The reconfigurable hardware called HyperCell comprises a reconfigurable execution fabric. The fabric is a collection of interconnected compute units. A typical use case of HyperCell is where it acts as a co-processor with a host and accelerates execution of IEs that are defined post-silicon. We demonstrate the effectiveness of our approach by evaluating the performance of some well-known integer kernels that are realized as IEs on HyperCell. Our methodology for realizing IEs through HyperCells permits overlapping of potentially all memory transactions with computations. We show significant improvement in performance for streaming applications over general purpose processor based solutions, by fully pipelining the data-path

international conference on vlsi design | 2015

Hardware Solution for Real-Time Face Recognition

Gopinath Mahale; Hamsika Mahale; Arnav Goel; S. K. Nandy; S. Bhattacharya; Ranjani Narayan

The objective of this paper is to come up with a scalable modular hardware solution for real-time Face Recognition (FR) on large databases. Existing hardware solutions use algorithms with low recognition accuracy suitable for real-time response. In addition, database size for these solutions is limited by on-chip resources making them unsuitable for practical real-time applications. Due to high computational complexity we do not choose algorithms in literature with superior recognition accuracy. Instead, we come up with a combination of Weighted Modular Principle Component Analysis (WMPCA) and Radial Basis Function Neural Network (RBFNN) which outperforms algorithms used in existing hardware solutions on highly illumination and pose variant face databases. We propose a hardware solution for real-time FR which uses parallel streams to perform independent modular computations. A salient feature of proposed hardware solution is that we store a major part of data on off-chip memory in a novel format, so that latencies experienced accessing off-chip memory does not impact performance. This enables us to work on databases of very large sizes. To test functional correctness, the proposed architecture is synthesized and tested on Virtex-6 LX550T FPGA. This emulated system is able to perform 450 recognitions per second on images of size 128 × 128 with 450 classes.

Explore More