Mehul Tikekar
Massachusetts Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mehul Tikekar.
IEEE Journal of Solid-state Circuits | 2014
Mehul Tikekar; Chao-Tsung Huang; Chiraag Juvekar; Vivienne Sze; Anantha P. Chandrakasan
High Efficiency Video Coding, the latest video standard, uses larger and variable-sized coding units and longer interpolation filters than H.264/AVC to better exploit redundancy in video signals. These algorithmic techniques enable a 50% decrease in bitrate at the cost of computational complexity, external memory bandwidth, and, for ASIC implementations, on-chip SRAM of the video codec. This paper describes architectural optimizations for an HEVC video decoder chip. The chip uses a two-stage subpipelining scheme to reduce on-chip SRAM by 56 kbytes-a 32% reduction. A high-throughput read-only cache combined with DRAM-latency-aware memory mapping reduces DRAM bandwidth by 67%. The chip is built for HEVC Working Draft 4 Low Complexity configuration and occupies 1.77 mm2 in 40-nm CMOS. It performs 4K Ultra HD 30-fps video decoding at 200 MHz while consuming 1.19 nJ/pixel of normalized system power.
design, automation, and test in europe | 2010
Masood Qazi; Mehul Tikekar; Lara Dolecek; Devavrat Shah; Anantha P. Chandrakasan
The impact of process variation in deep-submicron technologies is especially pronounced for SRAM architectures which must meet demands for higher density and higher performance at increased levels of integration. Due to the complex structure of SRAM, estimating the effect of process variation accurately has become very challenging. In this paper, we address this challenge in the context of estimating SRAM timing variation. Specifically, we introduce a method called loop flattening that demonstrates how the evaluation of the timing statistics in the complex, highly structured circuit can be reduced to that of a single chain of component circuits. To then very quickly evaluate the timing delay of a single chain, we employ a statistical method based on importance sampling augmented with targeted, high-dimensional, spherical sampling. Overall, our methodology provides an accurate estimation with 650X or greater speed-up over the nominal Monte Carlo approach.
international solid-state circuits conference | 2013
Chao-Tsung Huang; Mehul Tikekar; Chiraag Juvekar; Vivienne Sze; Anantha P. Chandrakasan
The latest video coding standard High Efficiency Video Coding (HEVC) [1] provides 50% improvement in coding efficiency compared to H.264/AVC, to meet the rising demand for video streaming, better video quality and higher resolutions. The coding gain is achieved using more complex tools such as larger and variable-size coding units (CU) in a hierarchical structure, larger transforms and longer interpolation filters. This paper presents an integrated circuit which supports Quad Full HD (QFHD, 3840×2160) video decoding for the HEVC draft standard. It addresses new design challenges for HEVC (“H.265”) with three primary contributions: 1) a system pipelining scheme which adapts to the variable-size largest coding unit (LCU) and provides a two-stage sub-pipeline for memory optimization; 2) unified processing engines to address the hierarchical coding structure and many prediction and transform block sizes in area-efficient ways; 3) a motion compensation (MC) cache which reduces DRAM bandwidth for the LCU and meets the high throughput requirements which are due to the long filters.
international conference on image processing | 2014
Mehul Tikekar; Chao-Tsung Huang; Vivienne Sze; Anantha P. Chandrakasan
High Efficiency Video Coding (HEVC) inverse transform for residual coding uses 2-D 4×4 to 32×32 transforms with higher precision as compared to H.264/AVCs 4×4 and 8×8 transforms resulting in an increased hardware complexity. In this paper, an energy and area-efficient VLSI architecture of an HEVC-compliant inverse transform and dequantization engine is presented. We implement a pipelining scheme to process all transform sizes at a minimum throughput of 2 pixel/cycle with zero-column skipping for improved throughput. We use data-gating in the 1-D Inverse Discrete Cosine Transform engine to improve energy-efficiency for smaller transform sizes. A high-density SRAM-based transpose memory is used for an area-efficient design. This design supports decoding of 4K Ultra-HD (3840×2160) video at 30 frame/sec. The inverse transform engine takes 98.1 kgate logic, 16.4 kbit SRAM and 10.82 pJ/pixel while the dequantization engine takes 27.7 kgate logic, 8.2 kbit SRAM and 1.10 pJ/pixel in 40 nm CMOS technology. Although larger transforms require more computation per coefficient, they typically contain a smaller proportion of non-zero coefficients. Due to this trade-off, larger transforms can be more energy-efficient.
IEEE Transactions on Very Large Scale Integration Systems | 2014
Chao-Tsung Huang; Mehul Tikekar; Anantha P. Chandrakasan
This paper presents a high-throughput and areaefficient VLSI architecture for intra prediction in the emerging high efficiency video coding standard. Three design techniques are proposed to address the complexity systematically: 1) a hierarchical memory deployment that stores neighboring samples in 4.9 Kb of static RAM (SRAM) instead of 43.2-k gates of registers and increases throughput by processing reference samples in registers; 2) a mode-adaptive scheduling scheme for all prediction units, which provides at least 2 samples/cycle throughput while using low-throughput SRAM and can achieve 2.46 samples/cycle on the average based on the experimental results; and 3) resource sharing for multipliers and the readout circuits of reference sample registers, which can save 2.5-k gates. These techniques can efficiently reduce area by 40% but induce more power because of additional signal transitions. Signal-gating circuits are then applied to reduce 69% of SRAM power and 32% of logic power, which cost only 1.0-k gates. When synthesized at 200 MHz with 40-nm process, the proposed architecture needs only 27.0-k gates and 4.9 Kb of single-port SRAM. The layout core area is 0.036 mm2, and the power consumption is 2.11 mW in the postlayout simulation. The corresponding performance can support quad full high-definition (HD) (3840 × 2160) video decoding at 30 frames/s.
visual communications and image processing | 2013
Chao-Tsung Huang; Chiraag Juvekar; Mehul Tikekar; Anantha P. Chandrakasan
In this paper, an area-efficient and high-throughput interpolation filter architecture is presented for the latest video coding standard, High Efficiency Video Coding. A unified filter design is first proposed for the 8-tap luma and 4-tap chroma filters to optimize area, which uses only 13 adders. And a 2D filter architecture is then devised with an adaptive scheduling which supports all symmetric prediction partitions with a throughput of at least two samples/cycle. Experimental results also show that this architecture can achieve 2.58 samples/cycle on the average. The total gate count is 45.2k when synthesized at 200MHz with 40nm process, and the corresponding performance can support at least 3840×2160 videos at 30 fps.
european solid state circuits conference | 2016
Priyanka Raina; Mehul Tikekar; Anantha P. Chandrakasan
Camera shake is the leading cause of blur in cell-phone camera images. Removing blur requires deconvolving the blurred image with a kernel which is typically unknown and needs to be estimated from the blurred image. This kernel estimation is computationally intensive and takes several minutes on a CPU which makes it unsuitable for mobile devices. This work presents the first hardware accelerator for kernel estimation for image deblurring applications. Our approach, using a multi-resolution IRLS deconvolution engine with DFT based matrix multiplication, a high-throughput image correlator and a high-speed selective update based gradient projection solver, achieves a 78× reduction in kernel estimation runtime, and a 56× reduction in total deblurring time for a 1920×1080 image enabling quick feedback to the user. Configurability in kernel size and number of iterations gives up to 10× energy scalability, allowing the system to trade-off runtime with image quality. The test chip, fabricated in 40 nm CMOS, consumes 105 mJ for kernel estimation running at 83 MHz and 0.9 V, making it suitable for integration into mobile devices.
symposium on vlsi circuits | 2017
Mehul Tikekar; Vivienne Sze; Anantha P. Chandrakasan
Data movement to and from off-chip memory dominates energy consumption in most video decoders, with DRAM accesses consuming 2.8x–6x more energy than the processing itself. We present a H.265/HEVC video decoder with embedded DRAM (eDRAM) as main memory. We propose the following techniques to optimize data movement and reduce the power consumption of eDRAM: 1) lossless compression is used to store reference frames in 2x fewer eDRAM banks, reducing refresh power by 33%; 2) eDRAM banks are powered up on-demand to further reduce refresh power by 33%; 3) syntax elements are distributed to four decoder cores in a partially compressed form to reduce decoupling buffer power by 4x. These approaches reduce eDRAM power by 2x in a fully-integrated H.265/HEVC decoder with the lowest reported system power. The decoder chip requires no external components and consumes 24.9–30.6mW for 1920×1080 video at 24–50 fps.
IEEE Journal of Solid-state Circuits | 2017
Priyanka Raina; Mehul Tikekar; Anantha P. Chandrakasan
Camera shake is a common cause of blur in cell-phone camera images. Removing blur requires deconvolving the blurred image with a kernel, which is typically unknown and needs to be estimated from the blurred image. This kernel estimation is computationally intensive and takes several minutes on a CPU, which makes it unsuitable for mobile devices. This paper presents the first hardware accelerator for kernel estimation for image deblurring applications. Our approach, using a multi-resolution iteratively reweighted least squares deconvolution engine with DFT-based matrix multiplication, a high-throughput image correlator, and a high-speed selective update-based gradient projection solver, achieves a 78x reduction in kernel estimation runtime, and a 56x reduction in total deblurring time for a
Sze | 2014
Mehul Tikekar; Chao-Tsung Huang; Chiraag Juvekar; Vivienne Sze; Anantha P. Chandrakasan
1920\times 1080