Is this you? Create Your Porfile

Basant K. Mohanty

Jaypee University of Engineering and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Basant K. Mohanty is active.

Explore More

Publication

Featured researches published by Basant K. Mohanty.

IEEE Transactions on Circuits and Systems for Video Technology | 2014

Efficient Integer DCT Architectures for HEVC

Pramod Kumar Meher; Sang Yoon Park; Basant K. Mohanty; Khoon Seong Lim; Chuohao Yeo

In this paper, we present area- and power-efficient architectures for the implementation of integer discrete cosine transform (DCT) of different lengths to be used in High Efficiency Video Coding (HEVC). We show that an efficient constant matrix-multiplication scheme can be used to derive parallel architectures for 1-D integer DCT of different lengths. We also show that the proposed structure could be reusable for DCT of lengths 4, 8, 16, and 32 with a throughput of 32 DCT coefficients per cycle irrespective of the transform size. Moreover, the proposed architecture could be pruned to reduce the complexity of implementation substantially with only a marginal affect on the coding performance. We propose power-efficient structures for folded and full-parallel implementations of 2-D DCT. From the synthesis result, it is found that the proposed architecture involves nearly 14% less area-delay product (ADP) and 19% less energy per sample (EPS) compared to the direct implementation of the reference algorithm, on average, for integer DCT of lengths 4, 8, 16, and 32. Also, an additional 19% saving in ADP and 20% saving in EPS can be achieved by the proposed pruning algorithm with nearly the same throughput rate. The proposed architecture is found to support ultrahigh definition 7680 × 4320 at 60 frames/s video, which is one of the applications of HEVC.

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2014

Area–Delay–Power Efficient Carry-Select Adder

Basant K. Mohanty; Sujit Kumar Patel

In this brief, the logic operations involved in conventional carry select adder (CSLA) and binary to excess-1 converter (BEC)-based CSLA are analyzed to study the data dependence and to identify redundant logic operations. We have eliminated all the redundant logic operations present in the conventional CSLA and proposed a new logic formulation for CSLA. In the proposed scheme, the carry select (CS) operation is scheduled before the calculation of-final-sum, which is different from the conventional approach. Bit patterns of two anticipating carry words (corresponding to cin = 0 and 1) and fixed cin bits are used for logic optimization of CS and generation units. An efficient CSLA design is obtained using optimized logic units. The proposed CSLA design involves significantly less area and delay than the recently proposed BEC-based CSLA. Due to the small carry-output delay, the proposed CSLA design is a good candidate for square-root (SQRT) CSLA. A theoretical estimate shows that the proposed SQRT-CSLA involves nearly 35% less area-delay-product (ADP) than the BEC-based SQRT-CSLA, which is best among the existing SQRT-CSLA designs, on average, for different bit-widths. The application-specified integrated circuit (ASIC) synthesis result shows that the BEC-based SQRT-CSLA design involves 48% more ADP and consumes 50% more energy than the proposed SQRT-CSLA, on average, for different bit-widths.

IEEE Transactions on Signal Processing | 2011

Memory Efficient Modular VLSI Architecture for Highthroughput and Low-Latency Implementation of Multilevel Lifting 2-D DWT

Basant K. Mohanty; Pramod Kumar Meher

In this paper, we present a modular and pipeline architecture for lifting-based multilevel 2-D DWT, without using line-buffer and frame-buffer. Overall area-delay product is reduced in the proposed design by appropriate partitioning and scheduling of the computation of individual decomposition-levels. The processing for different levels is performed by a cascaded pipeline structure to maximize the hardware utilization efficiency (HUE). Moreover, the proposed structure is scalable for high-throughput and area-constrained implementation. We have removed all the redundancies resulting from decimated wavelet filtering to maximize the HUE. The proposed design involves L pyramid algorithm (PA) units and one recursive pyramid algorithm (RPA) unit, where R=N/P , L=⌈log4P̅ ⌉ and P is the input block size, M and N, respectively, being the height and width of the image. The entire multilevel DWT is computed by the proposed structure in MR cycles. The proposed structure has O(8R×2L) cycles of output latency, which is very small compared to the latency of the existing structures. Interestingly, the proposed structure does not require any line-buffer or frame-buffer, unlike the existing folded structures which otherwise require a line-buffer of size O(N) and frame-buffer of size O(M/2×N/2) for multilevel 2-D computation. Instead of those buffers, the proposed structure involves only local registers and RAM of size O(N). The saving of line-buffer and frame-buffer achieved by the proposed design is an important advantage, since the image size could very often be as large as 512 × 512. From the simulation results we find that, the proposed scalable structure offers better slice-delay-product (SDP) for higher throughput of implementation since the on-chip memory of this structure remains almost unchanged with input block size. It has 17% less SDP than the best of the corresponding existing structures on average, for different input-block sizes and image sizes. It involves 1.92 times more transistors, but offers 12.2 times higher throughput and consumes 52% less power per output (PPO) compared to the other, on average for different input sizes.

IEEE Transactions on Signal Processing | 2013

A High-Performance Energy-Efficient Architecture for FIR Adaptive Filter Based on New Distributed Arithmetic Formulation of Block LMS Algorithm

Basant K. Mohanty; Pramod Kumar Meher

In this paper, we present an efficient distributed-arithmetic (DA) formulation for the implementation of block least mean square (BLMS) algorithm. The proposed DA-based design uses a novel look-up table (LUT)-sharing technique for the computation of filter outputs and weight-increment terms of BLMS algorithm. Besides, it offers significant saving of adders which constitute a major component of DA-based structures. Also, we have suggested a novel LUT-based weight updating scheme for BLMS algorithm, where only one set of LUTs out of M sets need to be modified in every iteration, where N=ML, N, and L are, respectively, the filter length and input block-size. Based on the proposed DA formulation, we have derived a parallel architecture for the implementation of BLMS adaptive digital filter (ADF). Compared with the best of the existing DA-based LMS structures, proposed one involves nearly L/6 times adders and L times LUT words, and offers nearly L times throughput of the other. It requires nearly 25% more flip-flops and does not involve variable shifters like those of existing structures. It involves less LUT access per output (LAPO) than the existing structure for block-size higher than 4. For block-size 8 and filter length 64, the proposed structure involves 2.47 times more adders, 15% more flip-flops, 43% less LAPO than the best of existing structures, and offers 5.22 times higher throughput. The number of adders of the proposed structure does not increase proportionately with block size; and the number of flip-flops is independent of block-size. This is a major advantage of the proposed structure for reducing its area delay product (ADP); particularly, when a large order ADF is implemented for higher block-sizes. ASIC synthesis result shows that, the proposed structure for filter length 64, has almost 14% and 30% less ADP and 25% and 37% less EPO than the best of the existing structures for block size 4 and 8, respectively.

IEEE Transactions on Circuits and Systems for Video Technology | 2013

Memory-Efficient High-Speed Convolution-Based Generic Structure for Multilevel 2-D DWT

Basant K. Mohanty; Pramod Kumar Meher

In this paper, we have proposed a design strategy for the derivation of memory-efficient architecture for multilevel 2-D DWT. Using the proposed design scheme, we have derived a convolution-based generic architecture for the computation of three-level 2-D DWT based on Daubechies (Daub) as well as biorthogonal filters. The proposed structure does not involve frame-buffer. It involves line-buffers of size 3(K-2)M/4 which is independent of throughput-rate, where K is the order of Daubechies/biorthogonal wavelet filter and M is the image height. This is a major advantage when the structure is implemented for higher throughput. The structure has regular data-flow, small cycle period TM and 100% hardware utilization efficiency. As per theoretical estimate, for image size 512 × 512, the proposed structure for Daub-4 filter requires 152 more multipliers and 114 more adders, but involves 82 412 less memory words and takes 10.5 times less time to compute three-level 2-D DWT than the best of the existing convolution-based folded structures. Similarly, compared with the best of the existing lifting-based folded structures, proposed structure for 9/7-filter involves 93 more multipliers and 166 more adders, but uses 85 317 less memory words and requires 2.625 times less computation time for the same image size. It involves 90 (nearly 47.6%) more multipliers and 118 (nearly 40.1%) more adders, but requires 2723 less memory words than the recently proposed parallel structure and performs the computation in nearly half the time of the other. Inspite of having more arithmetic components than the lifting-based structures, the proposed structure offers significant saving of area and power over the other due to substantial reduction in memory size and smaller clock-period. ASIC synthesis result shows that, the proposed structure for Daub-4 involves 1.7 times less area-delay-product (ADP) and consumes 1.21 times less energy per image (EPI) than the corresponding best available convolution-based structure. It involves 2.6 times less ADP and consumes 1.48 times less EPI than the parallel lifting-based structure.

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2012

Area- and Power-Efficient Architecture for High-Throughput Implementation of Lifting 2-D DWT

Basant K. Mohanty; Anurag Mahajan; Pramod Kumar Meher

We have suggested a new data-access scheme for the computation of lifting two-dimensional (2-D) discrete wavelet transform (DWT) without using data transposition. We have derived a linear systolic array directly from the dependence graph (DG) and a 2-D systolic array from a suitably segmented DG for parallel and pipeline implementation of 1-D DWT. These two systolic arrays are used as building blocks to derive the proposed transposition-free structure for lifting 2-D DWT. The proposed structure requires only a small on-chip memory of (4N + 8P) words and processes a block of P samples in every cycle, where N is the image width. Moreover, it has small output latency of nine cycles and does not require control signals which are commonly used in most of the existing DWT structures. Compared with the best of the existing high-throughput structures, the proposed structure requires the same arithmetic resources but involves 1.5N less on-chip memory and offers the same throughput rate. ASIC synthesis result shows that the proposed structure for block size 8 and image size 512 512 involves 28% less area, 35% less area-delay product, and 27% less energy per image than the best of the corresponding existing structures. Apart from that, the proposed structure is regular and modular; and it can be easily configured for different block sizes.

IEEE Transactions on Very Large Scale Integration Systems | 2015

FPGA Implementation of Orthogonal Matching Pursuit for Compressive Sensing Reconstruction

Hassan Rabah; Abbes Amira; Basant K. Mohanty; Somaya Al-Maadeed; Pramod Kumar Meher

In this paper, we present a novel architecture based on field-programmable gate arrays (FPGAs) for the reconstruction of compressively sensed signal using the orthogonal matching pursuit (OMP) algorithm. We have analyzed the computational complexities and data dependence between different stages of OMP algorithm to design its architecture that provides higher throughput with less area consumption. Since the solution of least square problem involves a large part of the overall computation time, we have suggested a parallel low-complexity architecture for the solution of the linear system. We have further modeled the proposed design using Simulink and carried out the implementation on FPGA using Xilinx system generator tool. We have presented here a methodology to optimize both area and execution time in Simulink environment. The execution time of the proposed design is reduced by maximizing parallelism by appropriate level of unfolding, while the FPGA resources are reduced by sharing the hardware for matrix-vector multiplication across the data-dependent sections of the algorithm. The hardware implementation on the Virtex6 FPGA provides significantly superior performance in terms of resource utilization measured in the number of occupied slices, and maximum usable frequency compared with the existing implementations. Compared with the existing similar design, the proposed structure involves 328 more DSP48s, but it involves 25802 less slices and 1.85 times less computation time for signal reconstruction with N = 1024, K = 256, and m = 36, where N is the number of samples, K is the size of the measurement vector, and m is the sparsity. It also provides a higher peak signal-to-noise ratio value of 38.9 dB with a reconstruction time of 0.34 μs, which is twice faster than the existing design. In addition, we have presented a performance metric to implement the OMP algorithm in resource constrained FPGA for the better quality of signal reconstruction.

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2008

Hardware-Efficient Systolic-Like Modular Design for Two-Dimensional Discrete Wavelet Transform

Pramod Kumar Meher; Basant K. Mohanty; Jagdish Chandra Patra

A systolic-like modular architecture is presented for hardware-efficient implementation of two-dimensional (2-D) discrete wavelet transform (DWT). The overall computation is decomposed into two distinct stages; where column processing is performed in stage-1, while row processing is performed in stage-2. Using a new data-access scheme and a novel folding technique, the computation of both the stages are performed concurrently for transposition-free implementation of 2-D DWT. The proposed design can offer nearly the same throughput rate, and requires the same or less the number of adders and multipliers as the best of the existing structures. The storage space is found to occupy most of the area in the existing 2-D DWT structures but the proposed structure does not require any on-chip or off-chip storage of input samples or storage/transposition of intermediate output. The proposed one, therefore, involves considerably less hardware complexity compared with the existing structures. Apart from that, it has less duration of cycle period in comparison to the existing structures, and has a latency of cycles while all the existing structures have latency of cycles, the filter order being small compared to the input size .

IEEE Transactions on Circuits and Systems | 2014

Memory Footprint Reduction for Power-Efficient Realization of 2-D Finite Impulse Response Filters

Basant K. Mohanty; Pramod Kumar Meher; Somaya Al-Maadeed; Abbes Amira

We have analyzed memory footprint and combinational complexity to arrive at a systematic design strategy to derive area-delay-power-efficient architectures for two-dimensional (2-D) finite impulse response (FIR) filter. We have presented novel block-based structures for separable and non-separable filters with less memory footprint by memory sharing and memory-reuse along with appropriate scheduling of computations and design of storage architecture. The proposed structures involve L times less storage per output (SPO), and nearly L times less energy consumption per output (EPO) compared with the existing structures, where L is the input block-size. They involve L times more arithmetic resources than the best of the corresponding existing structures, and produce L times more throughput with less memory band-width (MBW) than others. We have also proposed separate generic structures for separable and non-separable filter-banks, and a unified structure of filter-bank constituting symmetric and general filters. The proposed unified structure for 6 parallel filters involves nearly 3.6L times more multipliers, 3L times more adders, (N2-N+2) less registers than similar existing unified structure, and computes 6L times more filter outputs per cycle with 6L times less MBW than the existing design, where N is FIR filter size in each dimension. ASIC synthesis result shows that for filter size (4 × 4), input-block size L=4, and image-size (512 × 512), proposed block-based non-separable and generic non-separable structures, respectively, involve 5.95 times and 11.25 times less area-delay-product (ADP), and 5.81 times and 15.63 times less EPO than the corresponding existing structures. The proposed unified structure involves 4.64 times less ADP and 9.78 times less EPO than the corresponding existing structure.

IEEE Transactions on Very Large Scale Integration Systems | 2016

A High-Performance FIR Filter Architecture for Fixed and Reconfigurable Applications

Basant K. Mohanty; Pramod Kumar Meher

Transpose form finite-impulse response (FIR) filters are inherently pipelined and support multiple constant multiplications (MCM) technique that results in significant saving of computation. However, transpose form configuration does not directly support the block processing unlike direct-form configuration. In this paper, we explore the possibility of realization of block FIR filter in transpose form configuration for area-delay efficient realization of large order FIR filters for both fixed and reconfigurable applications. Based on a detailed computational analysis of transpose form configuration of FIR filter, we have derived a flow graph for transpose form block FIR filter with optimized register complexity. A generalized block formulation is presented for transpose form FIR filter. We have derived a general multiplier-based architecture for the proposed transpose form block filter for reconfigurable applications. A low-complexity design using the MCM scheme is also presented for the block implementation of fixed FIR filters. The proposed structure involves significantly less area-delay product (ADP) and less energy per sample (EPS) than the existing block implementation of direct-form structure for medium or large filter lengths, while for the short-length filters, the block implementation of direct-form FIR structure has less ADP and less EPS than the proposed structure. Application-specific integrated circuit synthesis result shows that the proposed structure for block size 4 and filter length 64 involves 42% less ADP and 40% less EPS than the best available FIR filter structure proposed for reconfigurable applications. For the same filter length and the same block size, the proposed structure involves 13% less ADP and 12.8% less EPS than that of the existing direct-form block FIR structure.

Explore More