Keshab K. Parhi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keshab K. Parhi is active.

Explore More

Publication

Featured researches published by Keshab K. Parhi.

IEEE Transactions on Very Large Scale Integration Systems | 1993

VLSI architectures for discrete wavelet transforms

Keshab K. Parhi; Takao Nishitani

A folded architecture and a digit-serial architecture are proposed for implementation of one- and two-dimensional discrete wavelet transforms. In the one-dimensional folded architecture, the computations of all wavelet levels are folded to the same low-pass and high-pass filters. The number of registers in the folded architecture is minimized by the use of a generalized life time analysis. The converter units are synthesized with a minimum number of registers using forward-backward allocation. The advantage of the folded architecture is low latency and its drawbacks are increased hardware area, less than 100% hardware utilization, and the complex routing and interconnection required by the converters used. These drawbacks are eliminated in the alternate digit-serial architecture at the expense of an increase in the system latency and some constraints on the wordlength. In latency-critical applications, the use of the folded architecture is suggested. If latency is not so critical, the digit-serial architecture should be used. The use of a combined folded and digit-serial architecture is proposed for implementation of two-dimensional discrete wavelet transforms. >

IEEE Transactions on Computers | 1991

Static rate-optimal scheduling of iterative data-flow programs via optimum unfolding

Keshab K. Parhi; David G. Messerschmitt

Rate-optimal compile-time multiprocessor scheduling of iterative dataflow programs suitable for real-time signal processing applications is discussed. It is shown that recursions or loops in the programs lead to an inherent lower bound on the achievable iteration period, referred to as the iteration bound. A multiprocessor schedule is rate-optimal if the iteration period equals the iteration bound. Systematic unfolding of iterative dataflow programs is proposed, and properties of unfolded dataflow programs are studied. Unfolding increases the number of tasks in a program, unravels the hidden concurrently in iterative dataflow programs, and can reduce the iteration period. A special class of iterative dataflow programs, referred to as perfect-rate programs, is introduced. Each loop in these programs has a single register. Perfect-rate programs can always be scheduled rate optimally (requiring no retiming or unfolding transformation). It is also shown that unfolding any program by an optimum unfolding factor transforms any arbitrary program to an equivalent perfect-rate program, which can then be scheduled rate optimally. This optimum unfolding factor for any arbitrary program is the least common multiple of the number of registers (or delays) in all loops and is independent of the node execution times. An upper bound on the number of processors for rate-optimal scheduling is given. >

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1989

Pipeline interleaving and parallelism in recursive digital filters. I. Pipelining using scattered look-ahead and decomposition

Keshab K. Parhi; David G. Messerschmitt

A look-ahead approach (referred to as scattered look-ahead) to pipeline recursive loops is introduced in a way that guarantees stability. A decomposition technique is proposed to implement the nonrecursive portion (generated due to the scattered look-ahead process) in a decomposed manner to obtain concurrent stable pipelined realizations of logarithmic implementation complexity with respect to the number of loop pipeline stages (as opposed to linear). The upper bound on the roundoff error in these pipelined filters is shown to improve with an increase in the number of loop pipeline stages. Efficient pipelined realizations are studied of both direct-form and state-space-form recursive digital filters. Based on the scattered look-ahead technique, fully pipelined and fully hardware efficient linear bidirectional systolic arrays for recursive digital filters are presented. The decomposition technique is extended to time-varying recursive systems. >

IEEE Transactions on Very Large Scale Integration Systems | 2004

High-speed VLSI architectures for the AES algorithm

Xinmiao Zhang; Keshab K. Parhi

This paper presents novel high-speed architectures for the hardware implementation of the Advanced Encryption Standard (AES) algorithm. Unlike previous works which rely on look-up tables to implement the SubBytes and InvSubBytes transformations of the AES algorithm, the proposed design employs combinational logic only. As a direct consequence, the unbreakable delay incurred by look-up tables in the conventional approaches is eliminated, and the advantage of subpipelining can be further explored. Furthermore, composite field arithmetic is employed to reduce the area requirements, and different implementations for the inversion in subfield GF(2/sup 4/) are compared. In addition, an efficient key expansion architecture suitable for the subpipelined round units is also presented. Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV1000 e-8 bg560 device in non-feedback modes, which is faster and is 79% more efficient in terms of equivalent throughput/slice than the fastest previous FPGA implementation known to date.

application specific systems architectures and processors | 1998

Low-Energy Digit-Serial/Parallel Finite Field Multipliers

Leilei Song; Keshab K. Parhi

Digit-serial architectures are best suited for systems requiring moderate sample rate and where area and power consumption are critical. This paper presents a new approach for designing digit-serial/parallel finite field multipliers. This approach combines both array-type and parallel multiplication algorithms, where the digit-level array-type algorithm minimizes the latency for one multiplication operation and the parallel architecture inside of each digit cell reduces both the cycle-time as well as the switching activities, hence power consumption. By appropriately constraining the feasible primitive polynomials, the mod p(x) operation involved in finite field multiplication can be performed in a more efficient way. As a result, the computation delay and energy consumption of one finite field multiplication using the proposed digit-serial/parallel architectures are significantly less than of those obtained by folding the parallel semi-systolic multipliers. Furthermore, their energy-delay products are reduced by a even larger percentage. Therefore, the proposed digit-serial/parallel architectures are attractive for both low-energy and high-performance applications.

international conference on computer communications | 1989

Distributed scheduling of broadcasts in a radio network

Rajiv Ramaswami; Keshab K. Parhi

A distributed algorithm is presented for obtaining an efficient and conflict-free broadcasting schedule in a multi-hop packet radio network. The inherent broadcast nature of the radio channel enables a nodes transmission to be received by all other nodes within range. Multiple transmissions can be scheduled simultaneously because of the multi-hop nature of the network. It is first shown that the construction of a broadcasting schedule of minimum length is NP-complete, and then a centralized algorithm based on a sequential graph-coloring heuristic is presented to construct minimal-length schedules. A distributed implementation of this algorithm is then proposed, which is based on circulating a token through the nodes in the network.<<ETX>>

Proceedings of the IEEE | 1989

Algorithm transformation techniques for concurrent processors

Keshab K. Parhi

Progress in supercomputing technology has led to two major trends. First, many existing algorithms need to be redesigned for efficient concurrent implementation using supercomputers. Second, a continuous increase will be apparent in the number of application-specific VLSI integrated circuits, which can provide the performance of supercomputers using single chips or chipsets (at the expense of design time for algorithm and architecture development). Both of these approaches require considerable effort in the development of algorithms for specific applications. Four independent algorithm transformation methodologies-program unfolding, retiming, look-ahead algorithms, and index mapping transformations-are reviewed. These transformation techniques exploit the available parallelism in iterative data-flow programs and create additional parallelism if necessary. >

IEEE Transactions on Circuits and Systems | 1991

A systematic approach for design of digit-serial signal processing architectures

Keshab K. Parhi

A systematic unfolding transformation technique for transforming bit-serial architecture into equivalent digit-serial ones is presented. The novel feature of the unfolding technique lies in the generation of functionally correct control circuits in the digit-serial architectures. For some applications bit-serial architectures may be too slow, and bit-parallel architectures may be faster than necessary and may require too much hardware. The desired sample rate in these applications can be achieved using the digit-serial approach, where multiple bits of a sample are processed in a single clock cycle. The number of bits processed in one clock cycle in the digit-serial systems is referred to as the digit size; the digit size can be any arbitrary integer (the digit size was restricted to be a divisor of wordlength in past ad hoc designs). Digit-serial implementation of twos complement adders and multipliers is described. Least-significant-bit-first bit-serial implementation of twos complement division, square-root, and compare-select operations are presented, and the corresponding digit-serial architectures for these operations are obtained using the unfolding algorithm. Unfolding of multiple-rate operations (such as interpolators and decimators) is also addressed. >

IEEE Journal of Solid-state Circuits | 1992

Synthesis of control circuits in folded pipelined DSP architectures

Keshab K. Parhi; Ching Yi Wang; Andrew P. Brown

A systematic folding transformation technique to fold any arbitrary signal processing algorithm data-flow graph to a hardware data-flow architecture, for a specified folding set and specified technology constraints, is presented. The folding set specifies the processor and the time partition at which the task is executed and is typically obtained by performing scheduling and resource allocation for the algorithm data-flow graph and the specified iteration period. The constraints imposed on the hardware architecture are also assumed to be known. The technique is used to derive the control circuitry of the hardware architecture. The authors derive conditions for the validity of a specified folding set, and present approaches to generate the dedicated architecture using systematic folding of tasks to operators. They propose automatic retiming and pipelining of algorithms described by data-flow graphs for folding. The folding algorithm is applied after preprocessing the data-flow graph for automated pipelining and retiming. >

IEEE Transactions on Circuits and Systems | 2004

Overlapped message passing for quasi-cyclic low-density parity check codes

Yanni Chen; Keshab K. Parhi

In this paper, a systematic approach is proposed to develop a high throughput decoder for quasi-cyclic low-density parity check (LDPC) codes, whose parity check matrix is constructed by circularly shifted identity matrices. Based on the properties of quasi-cyclic LDPC codes, the two stages of belief propagation decoding algorithm, namely, check node update and variable node update, could be overlapped and thus the overall decoding latency is reduced. To avoid the memory access conflict, the maximum concurrency of the two stages is explored by a novel scheduling algorithm. Consequently, the decoding throughput could be increased by about twice assuming dual-port memory is available.

Explore More