Kalyana C. Bollapalli

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kalyana C. Bollapalli is active.

Explore More

Publication

Featured researches published by Kalyana C. Bollapalli.

design, automation, and test in europe | 2010

Implementing digital logic with sinusoidal supplies

Kalyana C. Bollapalli; Sunil P. Khatri; Laszlo B. Kish

In this paper, a new type of combinational logic circuit realization is presented. Logic values are implemented as sinusoidal signals. Sinusoidal signals of the same frequency are phase shifted by π to destructively interfere with each other, and represent the logic 0 and 1 values of Boolean Logic. These properties of sinusoids can be used to identify a signal without ambiguity. Thus, representing logic values as sinusoidal signals yields a realizable system of logic. The paper presents a logic gate family that can operate using the sinusoidal signals for logic 0 and logic 1 values. Due to orthogonality of sinusoid signals with different frequencies, multiple sinusoids could be transmitted on a single wire. This provides a natural way of implementing multilevel logic. Signals traveling long distances could take advantage of this fact and can share interconnect lines. Recent research in circuit design has made it possible to harvest sinusoidal signals of the same frequency and 180° phase difference from a single resonant clock ring in a distributed manner. Other advantage of such a logic family is its immunity from external additive noise. The experiments in this paper indicate that this paradigm, when used to implement binary valued logic, yields an improvement in switching (dynamic) power.

international conference on computer design | 2009

A robust pulsed flip-flop and its use in enhanced scan design

Rajesh Kumar; Kalyana C. Bollapalli; Rajesh Garg; Tarun Soni; Sunil P. Khatri

Delay faults are frequently encountered in nanometer technologies. Therefore, it is critical to detect these faults during factory test. Testing for a delay fault requires the application of a pair of test vectors in an at-speed manner. To maximize the delay fault detection capability, it is desired that the vectors in this pair are independent. Independent vector pairs cannot always be applied to a circuit implemented with standard scan design approaches. However, this can be achieved by using enhanced scan flip-flops, which store two bits of data. This paper has two contributions. First, we develop a pulsed flip-flop (PFF) design. Second, we present an enhanced scan flipflop design, based on our PFF circuit. We have compared the performance of our pulse based flip-flop with recently published pulse based flip-flop designs, as well as a traditional master-slave D flip-flop. Our PFF shows significant improvements in power and timing compared to the other designs. Our pulse based enhanced scan flip-flop (PESFF) has 13% lower power dissipation and 26% better timing than a conventional D flipflop based enhanced scan flip-flop (DESFF). The layout area of our PESFF is 5.2% smaller than the DESFF. Monte Carlo simulations demonstrate that our design is more robust to process variations than the DESFF.

allerton conference on communication, control, and computing | 2009

Highly parallel decoding of space-time codes on graphics processing units

Kalyana C. Bollapalli; Yiyue Wu; Kanupriya Gulati; Sunil P. Khatri; A. Robert Calderbank

Graphics Processing Units (GPUs) with a few hundred extremely simple processors represent a paradigm shift for highly parallel computations. We use this emergent GPU architecture to provide a first demonstration of the feasibility of real time ML decoding (in software) of a high rate space-time block code that is representative of codes incorporated in 4th generation wireless standards such as WiMAX and LTE. The decoding algorithm is conditional optimization which reduces to a parallel calculation that is a natural fit to the architecture of low cost GPUs. Experimental results demonstrate that asymptotically the GPU implementation is more than 700 times faster than a standard serial implementation. These results suggest that GPU architectures have the potential to improve the cost / performance tradeoff of 4th generation wireless base stations. Additional benefits might include reducing the time required for system development and the time required for configuration and testing of wireless base stations.

international conference on computer design | 2009

A PLL design based on a standing wave resonant oscillator

Vinay Karkala; Kalyana C. Bollapalli; Rajesh Garg; Sunil P. Khatri

In this paper, we present a new continuously variable high frequency standing wave oscillator, and demonstrate its use in generating the phase locked clock signal of a digital IC. The ring based standing wave resonant oscillator is implemented with a plurality of wires connected in a mobius configuration, with a cross coupled inverter pair connected across the wires. The oscillation frequency can be modulated by two means. Coarse modification is achieved by altering the number of wires in the ring that participate in the oscillation, by driving a digital word to a set of passgates which are connected to each wire in the ring. Fine tuning of the oscillation frequency is achieved by varying the body bias voltage of both the PMOS transistors in the cross coupled inverter pair which sustains the oscillations in the resonant ring. We have validated our PLL design in a 90nm process technology. 3D parasitic RLCs for our oscillator simulations were extracted, with skin effect accounted for. Our PLL has been implemented to provide a frequency locking range from ∼6 GHz to ∼9 GHz, with a center frequency of 7.5 GHz. The oscillator alone consumes about 25 mW of power, and the complete PLL consumes a power of 28.5 mW. The observed jitter of the PLL is 2.56%.

international conference on computer design | 2009

On-chip bidirectional wiring for heavily pipelined systems using network coding

Kalyana C. Bollapalli; Rajesh Garg; Kanupriya Gulati; Sunil P. Khatri

In this paper, we describe a low-area, reduced-power on-chip point-to-point bidirectional communication scheme for heavily pipelined systems. When data needs to be transmitted bidirectionally between two on-chip locations, the traditional approach resorts to either using two unidirectional wires, or to using a single wire (with a unidirectional transfer at any given time instant). In contrast, our bidirectional communication scheme allows data to be transmitted simultaneously between two on-chip locations, with a single wire performing the bidirectional data transfer. Our approach borrows ideas from the emerging area of network coding (in the field of communication). By utilizing coding units (which also serve the purpose of buffering the signals) along the wire between the two endpoints, we are able to achieve the same throughput as a traditional approach, while reducing the total area utilization by about 49.8% (thereby reducing routing congestion), and the total power consumption by about 11.5%. The area and power results include the contribution of routing wires, coding units, drivers, the clock distribution network and the required reset wire. Our bidirectional communication approach is ideally suited for heavily pipelined data intensive systems.

great lakes symposium on vlsi | 2009

Low power and high performance sram design using bank-based selective forward body bias

Kalyana C. Bollapalli; Rajesh Garg; Kanupriya Gulati; Sunil P. Khatri

Leakage power consumption is a large fraction of the total power consumption in contemporary VLSI designs. Since memories occupy a large portion of the total area of many high-performance ICs, it is crucial to reduce the leakage energy of memories. This problem is particularly aggravated for memories implemented in the 45nm technology node, since these processes exhibit significantly higher leakage power. For these memories, leakage is a significant problem not only from a power point of view, but also from a performance degradation standpoint. In this paper, we quantify this problem and provide a solution, using a 512KByte SRAM implemented in a 45nm bulk process as a design example. We show that implementing the SRAM as a monolithic memory results in increased delay as well as power. We illustrate a methodology to optimally reduce leakage power and improve performance in memories by splitting the memory array into word line groups (WLGs) which are selectively forward body biased when accessed. We present a derivation of optimal number of WLGs and the forward body bias voltage value, and show that our approach results in a 9:2% access time reduction, and a 53:4% reduction in power during a read operation. Our approach also achieves an 18% reduction in power during a write operation and a 69% leakage power improvement. The area overhead of our scheme is 7:2% compared to a monolithic memory.

international conference on computer design | 2013

A low-jitter phase-locked resonant clock generation and distribution scheme

Ayan Mandal; Kalyana C. Bollapalli; Nikhil Jayakumar; Sunil P. Khatri; Rabi N. Mahaptra

Clock distribution networks have traditionally been optimized to minimize end-to-end delay of the distribution network. However, since most digital ICs have an on-chip PLL, a more relevant design goal is to minimize cycle-to-cycle jitter. In this paper, we present a novel low-jitter phase-locked clock generation and distribution methodology which uses resonant standing wave oscillators (SWOs). In contrast to traveling wave oscillator rings (TWOs or “rotary” clocks), our SWO achieves the same phase at every point in the ring, making it amenable to a synchronous design methodology. The standing wave oscillator is controlled by coarse as well as fine tuning. Coarse tuning is achieved by varying the ring inductance, while fine tuning is accomplished by varying the ring capacitance. Clock distribution is done by routing the resonant ring chip-wide in a “comb” like manner. Experimental results demonstrate that the cycle-to-cycle jitter and skew of our approach is dramatically lower than existing schemes, while the power consumption is significantly lower as well. These benefits occur due to the resonant nature of our SWO-based clock generation and distribution approach.

international conference on vlsi design | 2011

An Automated Approach for Minimum Jitter Buffered H-Tree Construction

Ayan Mandal; Nikhil Jayakumar; Kalyana C. Bollapalli; Sunil P. Khatri; Rabi N. Mahapatra

In recent fabrication technologies, buffered clock distribution networks have become increasingly popular due to increasing on-chip wiring delays. Traditionally, clock distribution networks has been optimized to minimize end-to-end skew of the distribution network. However, since most ICs have an on-chip PLL, we argue that the design goal of minimizing end-to-end jitter is more relevant. In this paper, we present a dynamic programming based approach to synthesize a minimum cost buffered H-tree clock distribution network. Our cost functions are a weighted sum of power and jitter, and a weighted sum of power and end-to-end delay of the distribution network. Our approach is based on precharacterizing the delay, jitter and power of buffered segments of different lengths, topologies, buffer sizes and wire-codes. Using this information, a dynamic programming (DP) engine automatically generates the optimal H-tree that minimizes the appropriate cost function. Compared to a manually constructed buffered H-tree network, our approaches are able to reduce both jitter (by as much as 28%, and power by as much as 46%. When optimizing for minimum jitter, the DP engine generates a H-tree with lower jitter than when optimizing for minimum delay, thereby validating our approach, and proving its usefulness.

Archive | 2011

Digital Logic Using Non-DC Signals

Kalyana C. Bollapalli; Sunil P. Khatri; Laszlo B. Kish

In this chapter, a new type of combinational logic circuit realization is presented. These logic circuits are based on non-DC representation of logic values . In this scheme, logic values can be represented by signals that are uncorrelated, for example, distinct values represented by independent stochastic processes (noise from independent sources). This provides a natural way of implementing multivalued logic . Signals driven over long distances could take advantage of this fact and can share interconnect lines. Alternately, sinusoidal signals can be used to represent logic values. Sinusoid signals of different frequencies are uncorrelated. This property of sinusoids can be used to identify a signal without ambiguity. This chapter presents a logic family that uses sinusoidal signals to represent logic 0 and logic 1 values. We present sinusoidal gates which exploit the anti-correlation of sinusoidal signals, as opposed to uncorrelated noise signals. This is achieved by employing a pair of sinusoid signals of the same frequency, but with a phase difference of 180°. Recent research in circuit design has made it possible to harvest sinusoidal signals of the same frequency and 180° phase difference from a single resonant clock ring , in a distributed manner. Another advantage of such a logic family is its immunity from external additive noise. The experiments in this chapter indicate that this paradigm, when used to implement binary valued logic, yields an improvement in switching (dynamic) power.

Journal of Low Power Electronics | 2009

Selective Forward Body Bias for High Speed and Low Power SRAMs

Kalyana C. Bollapalli; Rajesh Garg; Kanupriya Gulati; Sunil P. Khatri

Leakage power consumption is a large fraction of the total power consumption in contemporary VLSI designs. Since memories occupy a large fraction of the total area of many high-performance ICs, it is crucial to reduce the leakage energy of memories. This problem is particularly aggravated for memories implemented in the 45 nm technology node, since these processes have significantly higher leakage power. For these memories, leakage is a significant problem not only from a power point of view, but also from a performance degradation standpoint. In this paper, we quantify this problem and provide a solution, using a 45 nm 512 KByte SRAM as a design example. We show that implementing the SRAM as a monolithic memory results in increased delay as well as power. We illustrate a methodology to optimally reduce leakage power and improve performance in memories by splitting the memory array into word line groups (WLGs) and forward body bias a WLG selectively when it is accessed. We present a derivation of optimal number of WLGs and the forward body bias voltage value, and show that our approach results in a 9.2% access time reduction, and a 53.4% reduction in power during a read operation. Our approach also achieves an 18% reduction in power during a write operation and a 69% leakage power improvement. The area overhead of our scheme is 7.2% compared to a monolithic memory. Our approach does not compromise on the static noise margin (SNM). From an architectural standpoint, we also present a strategy to decide when a WLG should be left forward biased between non-successive accesses, in order to minimize the total power consumption of memory.

Explore More