Farhad Merchant | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Farhad Merchant is active.

Explore More

Publication

Featured researches published by Farhad Merchant.

international symposium on electronic system design | 2011

A Fully Pipelined Modular Multiple Precision Floating Point Multiplier with Vector Support

Alok Baluni; Farhad Merchant; S. K. Nandy; S. Balakrishnan

The rapid evolution of reconfigurable computing places a great demand for Floating Point Multipliers (FPMs) capable of supporting wide range of application domains from scientific computing to multimedia applications. While former needs the support of higher precision formats like Double Precision(DP) / Extended Precision(EP), the latter needs Single Instruction Multiple Data (SIMD) feature in Single Precision (SP) mode. This paper presents the design of an FPM catering to both the needs using a hierarchical design approach. The FPM supports nine parallel SP multiplications every cycle with a latency of two cycles and one DP/EP multiplication every cycle with a latency of three cycles. The FPM is architected to support all four IEEE rounding modes. Compared to other FPMs that support multiple precision and SIMD processing, our FPM achieves 9x throughput for vectored SP mode without penalising the throughput for DP/EP modes. This improvement in performance is achieved at a modest cost of 30 percent more area and 11 percent more power. The modular architecture of the proposed FPM results in significant power reduction upto 80 percent for scalar SP mode.

international conference on vlsi design | 2015

Micro-architectural Enhancements in Distributed Memory CGRAs for LU and QR Factorizations

Farhad Merchant; Arka Maity; Mahesh Mahadurkar; Kapil Vatwani; Ishan Munje; Madhava Krishna; S. Nalesh; Nandhini Gopalan; Soumyendu Raha; S. K. Nandy; Ranjani Narayan

LU and QR factorizations are the computationally dear part of many applications ranging from large scale simulations (e.g. Computational fluid dynamics) to augmented reality. These factorizations exhibit time complexity of O (n3) and are difficult to accelerate due to presence of bandwidth bound kernels, BLAS-1 or BLAS-2 (level-1 or level-2 Basic Linear Algebra Subprograms) along with compute bound kernels (BLAS-3, level-3 BLAS). On the other hand, Coarse Grained Reconfigurable Architectures (CGRAs) have gained tremendous popularity as accelerators in embedded systems due to their flexibility and ease of use. Provisioning these accelerators in High Performance Computing (HPC) platforms is the research challenge wrestled by the computer scientists. We consider a CGRA environment in which several Compute Elements (CEs) enhanced with Custom Functional Units (CFUs) are interconnected over a Network-on-Chip (NoC). In this paper, we carry out extensive micro-architectural exploration for accelerating core kernels like Matrix Multiplication (MM) (BLAS-3) for LU and QR factorizations. Our 5 different design enhancements lead to the reduction in the latency of BLAS-3 kernels. On a stand-alone CFU, we achieve up to 8x speed-up for MM. A commensurate improvement is observed for MM in a CGRA environment. We achieve better GF LOP S/mm2 compared to recent implementations.

application-specific systems, architectures, and processors | 2014

Efficient and scalable CGRA-based implementation of Column-wise Givens Rotation

Zoltán Endre Rákossy; Farhad Merchant; Axel Acosta-Aponte; S. K. Nandy; Anupam Chattopadhyay

Givens Rotation is a key computation-intensive block in embedded wireless applications. In order to achieve an efficient mapping which smoothly scales to the underlying architecture, we propose two new Column-based Givens Rotation algorithms, derived from traditional Fast Givens and Square-root and Division Free Givens algorithms. These algorithms allow annihilation of multiple elements in a column of the input matrix simultaneously, without a dependency bottle-neck allowing increased parallelism, resource sharing and scalability. The ease of mapping and scalability has been tested on a layered coarse-grained reconfigurable architecture reaching close to optimal results for highly parallel architectures.

Journal of Systems Architecture | 2014

A framework for post-silicon realization of arbitrary instruction extensions on reconfigurable data-paths

Saptarsi Das; Kavitha T. Madhu; Madhav Krishna; Nalesh Sivanandan; Farhad Merchant; Santhi Natarajan; Ipsita Biswas; Adithya Pulli; S. K. Nandy; Ranjani Narayan

In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of resources on the reconfigurable data-path. In this context we present the techniques used to realize IEs for applications that demand high throughput or those that must process data streams. The reconfigurable hardware called HyperCell comprises a reconfigurable execution fabric. The fabric is a collection of interconnected compute units. A typical use case of HyperCell is where it acts as a co-processor with a host and accelerates execution of IEs that are defined post-silicon. We demonstrate the effectiveness of our approach by evaluating the performance of some well-known integer kernels that are realized as IEs on HyperCell. Our methodology for realizing IEs through HyperCells permits overlapping of potentially all memory transactions with computations. We show significant improvement in performance for streaming applications over general purpose processor based solutions, by fully pipelining the data-path

2014 22nd International Conference on Very Large Scale Integration (VLSI-SoC) | 2014

Scalable and energy-efficient reconfigurable accelerator for column-wise givens rotation

Zoltán Endre Rákossy; Farhad Merchant; Axel Acosta-Aponte; S. K. Nandy; Anupam Chattopadhyay

A new layered reconfigurable architecture is proposed which exploits modularity, scalability and flexibility to achieve high energy efficiency and memory bandwidth. Using two flavors of Column-wise Givens rotation, derived from traditional Fast Givens and Square root and Division Free Givens Rotation algorithms the architecture is thoroughly evaluated for scalability, speed, area and energy. Combining an efficient mapping strategy of the highly parallel algorithms capable of annihilation of multiple elements of a column of the input matrix and using the new features of the architecture, 9 architectural variants were explored achieving a clean trade-off of execution speed versus area, while keeping relatively constant energy.

international conference on embedded computer systems architectures modeling and simulation | 2014

Co-exploration of NLA kernels and specification of Compute Elements in distributed memory CGRAs

Mahesh Mahadurkar; Farhad Merchant; Arka Maity; Kapil Vatwani; Ishan Munje; Nandhini Gopalan; S. K. Nandy; Ranjani Narayan

Coarse Grained Reconfigurable Architectures (CGRA) are emerging as embedded application processing units in computing platforms for Exascale computing. Such CGRAs are distributed memory multi-core compute elements on a chip that communicate over a Network-on-chip (NoC). Numerical Linear Algebra (NLA) kernels are key to several high performance computing applications. In this paper we propose a systematic methodology to obtain the specification of Compute Elements (CE) for such CGRAs. We analyze block Matrix Multiplication and block LU Decomposition algorithms in the context of a CGRA, and obtain theoretical bounds on communication requirements, and memory sizes for a CE. Support for high performance custom computations common to NLA kernels are met through custom function units (CFUs) in the CEs. We present results to justify the merits of such CFUs.

international conference on vlsi design | 2014

Efficient QR Decomposition Using Low Complexity Column-wise Givens Rotation (CGR)

Farhad Merchant; Anupam Chattopadhyay; Ganesh Garga; S. K. Nandy; Ranjani Narayan; Nandhini Gopalan

QR decomposition (QRD) is a widely used Numerical Linear Algebra (NLA) kernel with applications ranging from SONAR beamforming to wireless MIMO receivers. In this paper, we propose a novel Givens Rotation (GR) based QRD (GR-QRD) where we reduce the computational complexity of GR and exploit higher degree of parallelism. This low complexity Column-wise GR (CGR) can annihilate multiple elements of a column of a matrix simultaneously. The algorithm is first realized on a Two-Dimensional (2D) systolic array and then implemented on REDEFINE which is a Coarse Grained run-time Reconfigurable Architecture (CGRA). We benchmark the proposed implementation against state-of-the-art implementations to report better throughput, convergence and scalability.

international conference on vlsi design | 2016

Efficient Realization of Table Look-Up Based Double Precision Floating Point Arithmetic

Farhad Merchant; Nimash Choudhary; S. K. Nandy; Ranjani Narayan

In this paper we present different optimization techniques on look-up table based algorithms for double precision floating point arithmetic. Based on our analysis of different look-up table based algorithms in the literature, we re-engineer basics blocks of the algorithms (i.e. Multiplier (s) and adder (s)) to facilitate area and timing benefits to achieve higher performance. We propose different look-up table optimization techniques for the algorithms. We also analyze trade-off in employing exact rounding (0.5ulp) (unit in the last place) in the double precision floating point unit. Based on performance and extensibility criteria we take algorithms proposed by Wong and Goto as a base case to validate our optimization techniques and compare the performance with other algorithms in the literature. We improve the performance (latency × area) of Wong and Goto division algorithm by 26.94%.

international conference on vlsi design | 2016

Achieving Efficient QR Factorization by Algorithm-Architecture Co-design of Householder Transformation

Farhad Merchant; Tarun Vatwani; Anupam Chattopadhyay; Soumyendu Raha; S. K. Nandy; Ranjani Narayan

Householder Transformation (HT) is a prime building block of widely used numerical linear algebra primitives such as QR factorization. Despite years of intense research on HT, there exists a scope to expose higher Instruction Level Parallelism in HT through algorithmic transforms. In this paper, we propose several novel algorithmic transformations in HT to expose higher Instruction-Level Parallelism. Our propositions are backed by theoretical proofs and a series of experiments using commercial general-purpose processors. Finally, we show that algorithm-architecture co-design leads to the most efficient realization of HT. A detailed experimental study with architectural modifications is presented for a commercial CGRA. The benchmarking results with some of the recent HT implementations show 30-40% improvement in performance.

Parallel Processing Letters | 2017

Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

Farhad Merchant; Anupam Chattopadhyay; Soumyendu Raha; S. K. Nandy; Ranjani Narayan

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the...

Explore More