S. K. Nandy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where S. K. Nandy is active.

Explore More

Publication

Featured researches published by S. K. Nandy.

international conference on vlsi design | 2015

Micro-architectural Enhancements in Distributed Memory CGRAs for LU and QR Factorizations

Farhad Merchant; Arka Maity; Mahesh Mahadurkar; Kapil Vatwani; Ishan Munje; Madhava Krishna; S. Nalesh; Nandhini Gopalan; Soumyendu Raha; S. K. Nandy; Ranjani Narayan

LU and QR factorizations are the computationally dear part of many applications ranging from large scale simulations (e.g. Computational fluid dynamics) to augmented reality. These factorizations exhibit time complexity of O (n3) and are difficult to accelerate due to presence of bandwidth bound kernels, BLAS-1 or BLAS-2 (level-1 or level-2 Basic Linear Algebra Subprograms) along with compute bound kernels (BLAS-3, level-3 BLAS). On the other hand, Coarse Grained Reconfigurable Architectures (CGRAs) have gained tremendous popularity as accelerators in embedded systems due to their flexibility and ease of use. Provisioning these accelerators in High Performance Computing (HPC) platforms is the research challenge wrestled by the computer scientists. We consider a CGRA environment in which several Compute Elements (CEs) enhanced with Custom Functional Units (CFUs) are interconnected over a Network-on-Chip (NoC). In this paper, we carry out extensive micro-architectural exploration for accelerating core kernels like Matrix Multiplication (MM) (BLAS-3) for LU and QR factorizations. Our 5 different design enhancements lead to the reduction in the latency of BLAS-3 kernels. On a stand-alone CFU, we achieve up to 8x speed-up for MM. A commensurate improvement is observed for MM in a CGRA environment. We achieve better GF LOP S/mm2 compared to recent implementations.

international conference on embedded computer systems architectures modeling and simulation | 2014

Synthesis of Instruction Extensions on HyperCell, a reconfigurable datapath

Kavitha T. Madhu; Saptarsi Das; C. Madhava Krishna; S. Nalesh; S. K. Nandy; Ranjani Narayan

In this paper we present HyperCell as a reconfigurable datapath for Instruction Extensions (IEs). HyperCell comprises an array of compute units laid over a switch network. We present an IE synthesis methodology that enables post-silicon realization of IE datapaths on HyperCell. The synthesis methodology optimally exploits hardware resources in HyperCell to enable software pipelined execution of IEs. Exploitation of temporal reuse of data in HyperCell results in significant reduction of input/output bandwidth requirements of HyperCell.

international conference on vlsi design | 2016

Efficient Realization of Table Look-Up Based Double Precision Floating Point Arithmetic

Farhad Merchant; Nimash Choudhary; S. K. Nandy; Ranjani Narayan

In this paper we present different optimization techniques on look-up table based algorithms for double precision floating point arithmetic. Based on our analysis of different look-up table based algorithms in the literature, we re-engineer basics blocks of the algorithms (i.e. Multiplier (s) and adder (s)) to facilitate area and timing benefits to achieve higher performance. We propose different look-up table optimization techniques for the algorithms. We also analyze trade-off in employing exact rounding (0.5ulp) (unit in the last place) in the double precision floating point unit. Based on performance and extensibility criteria we take algorithms proposed by Wong and Goto as a base case to validate our optimization techniques and compare the performance with other algorithms in the literature. We improve the performance (latency × area) of Wong and Goto division algorithm by 26.94%.

international conference on vlsi design | 2016

Achieving Efficient QR Factorization by Algorithm-Architecture Co-design of Householder Transformation

Farhad Merchant; Tarun Vatwani; Anupam Chattopadhyay; Soumyendu Raha; S. K. Nandy; Ranjani Narayan

Householder Transformation (HT) is a prime building block of widely used numerical linear algebra primitives such as QR factorization. Despite years of intense research on HT, there exists a scope to expose higher Instruction Level Parallelism in HT through algorithmic transforms. In this paper, we propose several novel algorithmic transformations in HT to expose higher Instruction-Level Parallelism. Our propositions are backed by theoretical proofs and a series of experiments using commercial general-purpose processors. Finally, we show that algorithm-architecture co-design leads to the most efficient realization of HT. A detailed experimental study with architectural modifications is presented for a commercial CGRA. The benchmarking results with some of the recent HT implementations show 30-40% improvement in performance.

high performance computing and communications | 2015

Compiling HPC Kernels for the REDEFINE CGRA

Kavitha T. Madhu; Saptarsi Das; S. Nalesh; S. K. Nandy; Ranjani Narayan

In this paper, we present a compilation flow for HPC kernels on the REDEFINE coarse-grain reconfigurable architecture (CGRA). REDEFINE is a scalable macro-dataflow machine in which the compute elements (CEs) communicate through messages. REDEFINE offers the ability to exploit high degree of coarse-grain and pipeline parallelism. The CEs in REDEFINE are enhanced with reconfigurable macro data-paths called HyperCells that enable exploitation of fine-grain and pipeline parallelism at the level of basic instructions in static dataflow order. Application kernels that exhibit regularity in computations and memory accesses such as affine loop nests benefit from the architecture of HyperCell [1], [2]. The proposed compilation flow aims at exposing high degree of parallelism in loop nests in HPC application kernels using polyhedral analysis and generates meta-data to effectively utilize the computational resources in HyperCells. Memory is explicitly managed through compilers assistance. We address the compilation challenges such as partitioning with load balancing, mapping and scheduling computations and management of operand data while targeting multiple HyperCells in the REDEFINE architecture. The proposed solution scales well meeting the performance objectives of HPC computing.

high performance embedded architectures and compilers | 2016

Flexible resource allocation and management for application graphs on ReNÉ MPSoC

Kavitha T. Madhu; Anuj Rao; Saptarsi Das; Krishna C. Madhava; S. K. Nandy; Ranjani Narayan

Performance of an application on a many-core machine primarily hinges on the ability of the architecture to exploit parallelism and to provide fast memory accesses. Exploiting parallelism in static application graphs on a multicore target is relatively easy owing to the fact that compilers can map them onto an optimal set of processing elements and memory modules. Dynamic application graphs have computations and data dependencies that manifest at runtime and hence may not be schedulable statically. Load balancing of such graphs requires runtime support (such as support for work-stealing) but results in overheads due to data and code movement. In this work, we use ReNÉ MPSoC as an alternative to the traditional many-core processing platforms to target application kernel graphs. ReNÉ is designed to be used as an accelerator to a host and offers the ability to exploit massive parallelism at multiple granularities and supports work-stealing for dynamic load-balancing. Further, it offers handles to enable and disable work-stealing selectively. ReNÉ employs an explicitly managed global memory with minimal hardware support for address translation required for relocating application kernels. We present a resource management methodology on ReNE MPSoC that encompasses a lightweight resource management hardware module and a compilation flow. Our methodology aims at identifying resource requirements at compile time and create resource boundaries (per application kernel) to guarantee performance and maximize resource utilization. The approach offers similar flexibility in resource allocation as a dynamic scheduling runtime but guarantees performance since locality of reference of data and code can be ensured.

international symposium on electronic system design | 2014

Energy Efficient, Scalable, and Dynamically Reconfigurable FFT Architecture for OFDM Systems

S. Kala; S. Nalesh; S. K. Nandy; Ranjani Narayan

FFT is the most compute intensive operation that critically affects the OFDM system performance. In order to support the various OFDM standards, a scalable and reconfigurable FFT architecture is necessary. This paper presents an energy efficient and scalable FFT architecture, which can be dynamically reconfigured to adapt to specifications of different standards. The proposed architecture is based on Radix-43 algorithm and uses a parallel-pipelined unrolled architecture. The proposed architecture can be scaled to support FFTs of sizes up to 64K points. As a proof of concept, FFT architecture for computation of FFTs of sizes 64 to 4K point has been implemented in UMC 65nm 1P10M CMOS process with a maximum clock frequency of 125 MHz and area of 1.05mm2. The power consumption at 40 MHz is 33.5mW for the computation of 4K point FFT. Energy efficiency (FFTs per unit of energy) of the proposed architecture is 1176 for 1K point, 584 for 2K point and 291 for 4K point FFTs at 40 MHz. The proposed architecture shows better performance in terms of scalability and energy efficiency when compared to existing implementations.

applied reconfigurable computing | 2018

Achieving Efficient Realization of Kalman Filter on CGRA Through Algorithm-Architecture Co-design

Farhad Merchant; Tarun Vatwani; Anupam Chattopadhyay; Soumyendu Raha; S. K. Nandy; Ranjani Narayan

In this paper, we present efficient realization of Kalman Filter (KF) that can achieve up to 65% of the theoretical peak performance of underlying architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA) as a basic building block due to its versatility and REDEFINE Coarse Grained Reconfigurable Architecture (CGRA) is used as a platform for experiments since REDEFINE is capable of supporting realization of a set algorithmic compute structures at run-time on a Reconfigurable Data-path (RDP). We perform several hardware and software based optimizations in the realization of KF to achieve 116% improvement in terms of Gflops over the first realization of KF. Overall, with the presented approach for KF, 4-105x performance improvement in terms of Gflops/watt over several academically and commercially available realizations of KF is attained. In REDEFINE, we show that our implementation is scalable and the performance attained is commensurate with the underlying hardware resources

international conference on embedded computer systems architectures modeling and simulation | 2016

AccuRA: Accurate alignment of short reads on scalable reconfigurable accelerators

Santhi Natarajan; N Krishna Kumar; Debnath Pal; S. K. Nandy

Classified as a big data problem, Short Read Mapping (SRM) within the Next Generation Sequencing (NGS) pipeline presents profound technical and computing challenges. Existing solutions handle the high volume of data leveraging heuristics, to claim notable performance standards. The results from SRM leave a huge impact across fields, including medical diagnostics and drug discovery. In this context, we need precise, affordable, reliable and actionable results from SRM, to support any application, with uncompromised accuracy and performance. Here, we present AccuRA, a massively parallel, scalable, high performance reconfigurable accelerator for accurate alignment of short reads. Supplemented with multithreaded firmware architecture, AccuRA precisely aligns short reads, at a fine-grained single nucleotide resolution, and offers full coverage of the genome. AccuRAs parallel dynamic programming kernels seamlessly perform traceback process in hardware simultaneously along with forward scan, thus achieving SRM in the minimum possible and deterministic time. The AccuRA prototype, hosting eight kernel units on a single reconfigurable device, aligns short reads with an alignment performance of 20.48 Giga Cell Updates Per Second (GCUPs). AccuRA also scales well at multiple levels of design granularity, while successfully aligning genomes of various sizes, ranging from small archeal, bacterial, fungal genomes, to the large mammalian human genome.

applied reconfigurable computing | 2016

Performance Evaluation of Feed-Forward Backpropagation Neural Network for Classification on a Reconfigurable Hardware Architecture

Mahnaz Mohammadi; Rohit Ronge; Sanjay S. Singapuram; S. K. Nandy

Performance of classification using Feed-Forward Backpropagation Neural Network FFBPNN on a reconfigurable hardware architecture is evaluated in this paper. The hardware architecture used for implementation of FFBPNN in this paper is a set of interconnected HyperCells which serve as reconfigurable data paths for the network. The architecture is easily scalable and able to implement networks with no limitation on their number of input and output dimensions. The performance of FFBPNN implemented on network of HCs using Xilinx Virtex 7 XC7V2000T as target FPGA is compared with software implementation and GPU implementation of FFBPNN. Results show speed up of 1.02X-3.49X over equivalent software implementation on Intel Core 2 Quad and 1.07X-6X over GPU NVIDIA GTX650.

Explore More