Mahesh Mehendale | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mahesh Mehendale is active.

Explore More

Publication

Featured researches published by Mahesh Mehendale.

international conference on computer aided design | 1995

Synthesis of multiplier-less FIR filters with minimum number of additions

Mahesh Mehendale; Sunil D. Sherlekar; G. Venkatesh

In this paper we present optimizing transformations to minimize the number of additions+subtractions in both the direct form (/spl Sigma/ A/sub i/X/sub n-i/ based) and its transposed form (Multiple Constant Multiplication based) implementation of FIR filters. These transformations are based on the iterative elimination of 2-bit common subexpressions in the coefficients binary representations. We give detailed description of the algorithms and present results for eight low pass FIR filters with the number of coefficients ranging from 16 to 128. The results show upto 35% reduction in the number of additions+subtractions to implement /spl Sigma/ A/sub i/X/sub n-i/ based FIR filter structures and upto 38% reduction to implement MCM based structures.

IEEE Transactions on Very Large Scale Integration Systems | 1998

Low-power realization of FIR filters on programmable DSPs

Mahesh Mehendale; Sunil D. Sherlekar; G. Venkatesh

This paper addresses the problem of reducing power dissipation of finite impulse response (FIR) filters implemented on programmable digital signal processors (DSPs). We describe a generic DSP architecture and identify the main sources of power dissipation during FIR filtering. We present seven transformations to reduce power dissipated in one or more of these sources. These transformations complement each other and together operate at algorithmic, architectural, logic and layout levels of design abstraction. Each of the transformations is discussed in detail and the results are presented to highlight its effectiveness. We show that the power dissipation can be reduced by more than 40% using these transforms. The transformations have been encapsulated in a framework that provides a comprehensive solution to low-power realization of FIR filters on programmable DSPs.

IEEE Transactions on Circuits and Systems for Video Technology | 2011

Memory Bandwidth and Power Reduction Using Lossy Reference Frame Compression in Video Encoding

Ajit Deepak Gupte; Bharadwaj Amrutur; Mahesh Mehendale; Ajit Venkat Rao; Madhukar Budagavi

Large external memory bandwidth requirement leads to increased system power dissipation and cost in video coding application. Majority of the external memory traffic in video encoder is due to reference data accesses. We describe a lossy reference frame compression technique that can be used in video coding with minimal impact on quality while significantly reducing power and bandwidth requirement. The low cost transformless compression technique uses lossy reference for motion estimation to reduce memory traffic, and lossless reference for motion compensation (MC) to avoid drift. Thus, it is compatible with all existing video standards. We calculate the quantization error bound and show that by storing quantization error separately, bandwidth overhead due to MC can be reduced significantly. The technique meets key requirements specific to the video encode application. 24-39% reduction in peak bandwidth and 23-31% reduction in total average power consumption are observed for IBBP sequences.

VLSI Signal Processing, VIII | 1995

Coefficient optimization for low power realization of FIR filters

Mahesh Mehendale; S.D. Sherlekar; G. Venkatesh

In this paper we present an algorithm for optimizing coefficients of a Finite Impulse Response (FTR) filter, so as to reduce power dissipation of its implementation on a programmable Digital Signal Processor. We first identify the sources of power dissipation and show that the power dissipation depends on the total Hamming distance between successive coefficient values. We then present an algorithm that optimizes coefficients so as to minimize this measure. Experimental results on six FIR filter examples show that the coefficient optimization algorithm results in upto 36% reduction in the total Hamming distance. This directly translates into reduction in the power dissipation in the coefficient memory data bus and the multiplier.

international symposium on microarchitecture | 2014

Bi-Modal DRAM Cache: A Scalable and Effective Die-Stacked DRAM Cache

Nagendra Gulur; Mahesh Mehendale; R. Manikantan; R. Govindarajan

In this paper, we present Bi-Modal Cache - a flexible stacked DRAM cache organization which simultaneously achieves several objectives: (i) improved cache hit ratio, (ii) moving the tag storage overhead to DRAM, (iii) lower cache hit latency than tags-in-SRAM, and (iv) reduction in off-chip bandwidth wastage. The Bi-Modal Cache addresses the miss rate versus off-chip bandwidth dilemma by organizing the data in a bi-modal fashion - blocks with high spatial locality are organized as large blocks and those with little spatial locality as small blocks. By adaptively selecting the right granularity of storage for individual blocks at run-time, the proposed DRAM cache organization is able to make judicious use of the available DRAM cache capacity as well as reduce the off-chip memory bandwidth consumption. The Bi-Modal Cache improves cache hit latency despite moving the metadata to DRAM by means of a small SRAM based Way Locator. Further by leveraging the tremendous internal bandwidth and capacity that stacked DRAM organizations provide, the Bi-Modal Cache enables efficient concurrent accesses to tags and data to reduce hit time. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement (in terms of Average Normalized Turnaround Time (ANTT)) of 10.8%, 13.8% and 14.0% in 4-core, 8-core and 16-core workloads respectively.In this paper, we present Bi-Modal Cache - a flexible stacked DRAM cache organization which simultaneously achieves several objectives: (i) improved cache hit ratio, (ii) moving the tag storage overhead to DRAM, (iii) lower cache hit latency than tags-in-SRAM, and (iv) reduction in off-chip bandwidth wastage. The Bi-Modal Cache addresses the miss rate versus off-chip bandwidth dilemma by organizing the data in a bi-modal fashion - blocks with high spatial locality are organized as large blocks and those with little spatial locality as small blocks. By adaptively selecting the right granularity of storage for individual blocks at run-time, the proposed DRAM cache organization is able to make judicious use of the available DRAM cache capacity as well as reduce the off-chip memory bandwidth consumption. The Bi-Modal Cache improves cache hit latency despite moving the metadata to DRAM by means of a small SRAM based Way Locator. Further by leveraging the tremendous internal bandwidth and capacity that stacked DRAM organizations provide, the Bi-Modal Cache enables efficient concurrent accesses to tags and data to reduce hit time. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement (in terms of Average Normalized Turnaround Time (ANTT)) of 10.8%, 13.8% and 14.0% in 4-core, 8-core and 16-core workloads respectively.

international solid-state circuits conference | 2012

A true multistandard, programmable, low-power, full HD video-codec engine for smartphone SoC

Mahesh Mehendale; Subrangshu Das; Mohit Sharma; Mihir Mody; Ratna M. V. Reddy; Joseph Patrick Meehan; Hideo Tamama; Brian Carlson; Mike Polley

In this paper, we present IVA-HD, a true multistandard, programmable, full HD video coding engine which adopts optimal hardware-software partitioning to achieve the low-power and area requirements of the OMAP 4 processor. Unlike the approach of using separate IPs for encoder and decoder, IVA-HD uses an integrated codec engine which is area efficient, as most of the decoder logic is reused for the encoder. IVA-HD is architected to perform stream-rate and pixel- rate processing in a single pipeline (that processes one 16x16 macroblock at a time), so as to support the latency requirements of video conferencing.

international conference on computer aided design | 2001

Area and power reduction of embedded DSP systems using instruction compression and re-configurable encoding

Subash G. Chandar; Mahesh Mehendale; R. Govindarajan

In embedded control applications, system cost and power/energy consumption are key considerations. In such applications, program memory forms a significant part of the chip area. Hence reducing code size reduces the system cost significantly. A significant part of the total power is consumed in fetching instructions from the program memory. Hence reducing instruction fetch power has been a key target for reducing power consumption. To reduce the cost and power consumption, embedded systems in these applications use application specific processors that are fine tuned to provide better solutions in terms of code density, and power consumption. Further fine tuning to suit each particular application in the targeted class can be achieved through reconfigurable architectures. In this paper, we propose a reconfiguration mechanism, called Instruction Re-map Table, to re-map the instructions to shorter length code words. Using this mechanism, frequently used set of instructions can be compressed. This reduces code size and hence the cost. Secondly, we use the same mechanism to target power reduction by encoding frequently used instruction sequences to Gray codes. Such encodings, along with instruction compression, reduce the instruction fetch power. We enhance Texas Instruments DSP core TMS320C27x to incorporate this mechanism and evaluate the improvements on code size and instruction fetch energy using real life embedded control application programs as benchmarks. Our scheme reduces the code size by over 10% and the energy consumed by over 40%.

asia and south pacific design automation conference | 1995

Techniques for low power realization for FIR filters

Mahesh Mehendale; Sunil D. Sherlekar; G. Venkatesh

We propose techniques for low power realization of FIR filters on programmable DSPs. We first analyse the FIR implementation to arrive at useful measures to reduce power and present techniques that exploit these measures. We then identify limitations of the existing DSP architectures in implementing these techniques and propose simple architectural extensions to overcome these limitations. Finally we present experimental results on real FIR filter examples that show up to 88% reduction in coefficient memory data bus power, upto 49% reduction in coefficient memory address bus power.

international conference on vlsi design | 2000

Low power realization of residue number system based FIR filters

M. N. Mahesh; Mahesh Mehendale

In this paper, we present algorithmic and architectural transforms for low power realization of Residue Number System (RNS) based FIR filters. These transforms have been systematically derived so as to achieve power reduction by voltage scaling, switched capacitance reduction and reduction in signal activity. We show how some of the existing techniques can be suitably adopted to RNS based implementations and also propose new techniques that exploit the specific properties of RNS based computation. We present results to show the effectiveness of our techniques. The results for modulo-5 and modulo-7 indicate that using just two of these techniques (coefficient encoding and coefficient ordering), power reduction of up to 33% can be achieved.

international conference on vlsi design | 1997

Area-delay tradeoff in distributed arithmetic based implementation of FIR filters

Mahesh Mehendale; Sunil D. Sherlekar; G. Venkatesh

In this paper we present coefficient memory vs number of additions tradeoff in distributed arithmetic based implementation of FIR filters. Such a capability is key to be able to explore a wider search space during system level design. We present two techniques based an multiple memory banks and multirate architectures to achieve this tradeoff. These techniques along with 1-bit-at-a-time and 2-bits-at-a-time data access mechanisms enable as many as 16 different data points in the area-delay space. We present analytical expressions to compute coefficient memory size and number of additions for these implementations. We present results for all the 16 DA based implementations of three FIR filters with two values of input data precision. We also present the resultant area-delay curves for these filters.

Explore More