Sunil P. Khatri | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sunil P. Khatri is active.

Explore More

Publication

Featured researches published by Sunil P. Khatri.

computer aided verification | 1996

VIS: A System for Verification and Synthesis

Robert K. Brayton; Gary D. Hachtel; Alberto L. Sangiovanni-Vincentelli; Fabio Somenzi; Adnan Aziz; Szu-Tsung Cheng; Stephen A. Edwards; Sunil P. Khatri; Yuji Kukimoto; Abelardo Pardo; Shaz Qadeer; Rajeev K. Ranjan; Shaker Sarwary; Thomas R. Shiple; Gitanjali Swamy; Tiziano Villa

ion Manual abstraction can be performed by giving a file containing the names of variables to abstract. For each variable appearing in the file, a new primary input node is created to drive all the nodes that were previously driven by the variable. Abstracting a net effectively allows it to take any value in its range, at every clock cycle. Fair CTL model checking and language emptiness check VIS performs fair CTL model checking under Buchi fairness constraints. In addition, VIS can perform language emptiness checking by model checking the formula EG true. The language of a design is given by sequences over the set of reachable states that do not violate the fairness constraint. The language emptiness check can be used to perform language containment by expressing the set of bad behaviors as another component of the system. If model checking or language emptiness fail, VIS reports the failure with a counterexample, (i.e., behavior seen in the system that does not satisfy the property for model checking, or valid behavior seen in the system for language emptiness). This is called the “debug” trace. Debug traces list a set of states that are on a path to a fair cycle and fail the CTL formula. Equivalence checking VIS provides the capability to check the combinational equivalence of two designs. An important usage of combinational equivalence is to provide a sanity check when re-synthesizing portions of a network. VIS also provides the capability to test the sequential equivalence of two designs. Sequential verification is done by building the product finite state machine, and checking whether a state where the values of two corresponding outputs differ, can be reached from the set of initial states of the product machine. If this happens, a debug trace is provided. Both combinational and sequential verification are implemented using BDD-based routines. Simulation VIS also provides traditionaldesign verification in the form of a cycle-based simulator that uses BDD techniques. Since VIS performs both formal verification and simulation using the same data structures, consistency between them is ensured. VIS can generate random input patterns or accept user-specified input patterns. Any subtree of the specified hierarchy may be simulated.

international conference on computer aided design | 2004

A metal and via maskset programmable VLSI design methodology using PLAs

Nikhil Jayakumar; Sunil P. Khatri

In recent times there has been a substantial increase in the cost and complexity of fabricating a VLSI chip. The lithography masks themselves can cost between /spl epsi/ and /spl ges/. It is conjectured that due to these increasing costs, the number of ASIC starts in the last few years has declined. We address this problem by using an array of dynamic PLAs which require only metal and via mask customization in order to implement a new design. This would allow several similar-sized designs to share the same base set of masks (right up to the metal layers) and only have different metal and via masks. We have implemented our methodology for both combinational and sequential designs, and demonstrate that our approach strikes a reasonable compromise between ASIC and field programmable design methodologies in terms of placed-and-routed area and delay. Our method has a 2.89/spl times/ (3.58/spl times/) delay overhead and a 4.96/spl times/ (3.44/spl times/) area overhead compared to standard cells for combinational (sequential) designs.

high performance interconnects | 2001

Analysis and avoidance of cross-talk in on-chip buses

Chunjie Duan; Anup Tirumala; Sunil P. Khatri

We present techniques to analyze and alleviate cross-talk in on-chip buses. With rapidly shrinking process feature sizes, wire delay is becoming a large fraction of the overall delay of a circuit. Additionally, the increasing cross-coupling capacitances between wires on the same metal layer create a situation where the delay of a wire is strongly dependent on the electrical state of its neighboring wires. The delay of a wire can vary widely depending on whether its neighbors perform a like or unlike transition. This effect is acute for long on-chip buses. In this work, we classify cross-talk interactions between the wires of an on-chip bus. We present encoding techniques which can help a designer trade off cross-talk against area overhead. Our experimental results show that the proposed techniques result in reduced delay variation due to cross-talk. As a result, the overall delay of a bus actually decreases even after the use of the encoding scheme.

design automation conference | 2008

Towards acceleration of fault simulation using graphics processing units

Kanupriya Gulati; Sunil P. Khatri

In this paper, we explore the implementation of fault simulation on a graphics processing unit (GPU). In particular, we implement a fault simulator that exploits thread level parallelism. Fault simulation is inherently parallelizable, and the large number of threads that can be computed in parallel on a GPU results in a natural fit for the problem of fault simulation. Our implementation fault- simulates all the gates in a particular level of a circuit, including good and faulty circuit simulations, for all patterns, in parallel. Since GPUs have an extremely large memory bandwidth, we implement each of our fault simulation threads (which execute in parallel with no data dependencies) using memory lookup. Fault injection is also done along with gate evaluation, with each thread using a different fault injection mask. All threads compute identical instructions, but on different data, as required by the Single Instruction Multiple Data (SIMD) programming semantics of the GPU. Our results, implemented on a NVIDIA GeForce GTX 8800 GPU card, indicate that our approach is on average 35 x faster when compared to a commercial fault simulation engine. With the recently announced Tesla GPU servers housing up to eight GPUs, our approach would be potentially 238 times faster. The correctness of the GPU based fault simulator has been verified by comparing its result with a CPU based fault simulator.

IEEE Transactions on Very Large Scale Integration Systems | 2009

Efficient On-Chip Crosstalk Avoidance CODEC Design

Chunjie Duan; Victor H. Cordero Calle; Sunil P. Khatri

Interconnect delay has become a limiting factor for circuit performance in deep sub-micrometer designs. As the crosstalk in an on-chip bus is highly dependent on the data patterns transmitted on the bus, different crosstalk avoidance coding schemes have been proposed to boost the bus speed and/or reduce the overall energy consumption. Despite the availability of the codes, no systematic mapping of data words to codewords has been proposed for CODEC design. This is mainly due to the nonlinear nature of the crosstalk avoidance codes (CAC). The lack of practical CODEC construction schemes has hampered the use of such codes in practical designs. This work presents guidelines for the CODEC design of the ldquoforbidden pattern free crosstalk avoidance coderdquo (FPF-CAC). We analyze the properties of the FPF-CAC and show that mathematically, a mapping scheme exists based on the representation of numbers in the Fibonacci numeral system. Our first proposed CODEC design offers a near-optimal area overhead performance. An improved version of the CODEC is then presented, which achieves theoretical optimal performance. We also investigate the implementation details of the CODECs, including design complexity and the speed. Optimization schemes are provided to reduce the size of the CODEC and improve its speed.

asia and south pacific design automation conference | 2009

Fast circuit simulation on graphics processing units

Kanupriya Gulati; John F. Croix; Sunil P. Khatri; Rahm Shastry

SPICE based circuit simulation is a traditional workhorse in the VLSI design process. Given the pivotal role of SPICE in the IC design flow, there has been significant interest in accelerating SPICE. Since a large fraction (on average 75%) of the SPICE runtime is spent in evaluating transistor model equations, a significant speedup can be availed if these evaluations are accelerated. This paper reports on our early efforts to accelerate transistor model evaluations using a Graphics Processing Unit (GPU). We have integrated this accelerator with a commercial fast SPICE tool. Our experiments demonstrate that significant speedups (2.36× on average) can be obtained. The asymptotic speedup that can be obtained is about 4×. We demonstrate that with circuits consisting of as few as about 1000 transistors, speedups in the neighborhood of this asymptotic value can be obtained. By utilizing the recently announced (but not currently available) quad GPU systems, this speedup could be enhanced further, especially for larger designs.

IEEE Transactions on Very Large Scale Integration Systems | 2009

A Fast Hardware Approach for Approximate, Efficient Logarithm and Antilogarithm Computations

Suganth Paul; Nikhil Jayakumar; Sunil P. Khatri

The realization of functions such as log() and antilog() in hardware is of considerable relevance, due to their importance in several computing applications. In this paper, we present an approach to compute log() and antilog() in hardware. Our approach is based on a table lookup, followed by an interpolation step. The interpolation step is implemented in combinational logic, in a field-programmable gate array (FPGA), resulting in an area-efficient, fast design. The novelty of our approach lies in the fact that we perform interpolation efficiently, without the need to perform multiplication or division, and our method performs both the log() and antilog() operation using the same hardware architecture. We compare our work with existing methods, and show that our approach results in significantly lower memory resource utilization, for the same approximation errors. Also our method scales very well with an increase in the required accuracy, compared to existing techniques.

international conference on computer aided design | 2005

Practical techniques to reduce skew and its variations in buffered clock networks

Ganesh Venkataraman; Nikhil Jayakumar; Jiang Hu; Peng Li; Sunil P. Khatri; Anand Rajaram; Patrick McGuinness; Charles J. Alpert

Clock skew is becoming increasingly difficult to control due to variations. Link based non-tree clock distribution is a cost-effective technique for reducing clock skew variations. However, previous works based on this technique were limited to unbuffered clock networks and neglected spatial correlations in the experimental validation. In this work, we overcome these shortcomings and make the link based non-tree approach feasible for realistic designs. The short circuit risk and multi-driver delay issues in buffered non-tree clock networks are investigated. Our approach is validated with SPICE based Monte Carlo simulations, considering spatial correlations among variations. The experimental results show that our approach can reduce the maximal skew by 47%, improve the skew yield from 15% to 73% on average with a decrease on the total wire and buffer capacitance.

design, automation, and test in europe | 2004

Exploiting crosstalk to speed up on-chip buses

Chunjie Duan; Sunil P. Khatri

In modern VLSI processes, the cross-coupling capacitance between adjacent neighboring wires on the same metal layer is a very large fraction of the total wire capacitance. This leads to problems of delay variation due to crosstalk and reduced noise immunity, arguably one of the biggest obstacles in the design ICs in recent times. This problem is particularly severe in long on-chip buses, since bus signals are routed at minimum pitch for long distances. In this work, we propose to solve this problem by the use of crosstalk canceling CODECs. We only utilize memoryless CODECs, to reduce the logical complexity and enhance the robustness of our techniques. Bus data patterns can be classified (as 4/spl middot/C, 3/spl middot/C, 2/spl middot/C, 1/spl middot/C or 0/spl middot/C patterns) based on the maximum amount of crosstalk that they can exhibit. Crosstalk avoidance CODECs which eliminate 4/spl middot/C and 3/spl middot/C patterns have been reported. In this paper, we describe crosstalk avoidance techniques which eliminate 2/spl middot/C and 1/spl middot/C patterns. We describe an analytical methodology to accurately characterize the bus area overhead 2/spl middot/C pattern CODECs. Using these results, we characterize the area overhead versus crosstalk immunity achieved. A similar exercise is performed for 1/spl middot/C patterns. Our experimental results show that by using 2/spl middot/C crosstalk canceling techniques, buses can be sped up by up to a factor of 6 with an area overhead of about 200%, and that 1/spl middot/C techniques are not very robust.

asia and south pacific design automation conference | 2009

Accelerating statistical static timing analysis using graphics processing units

Kanupriya Gulati; Sunil P. Khatri

We explore the implementation of Monte Carlo based statistical static timing analysis (SSTA) on a Graphics Processing Unit (GPU). SSTA via Monte Carlo simulations is a computationally expensive, but important step required to achieve design timing closure. It provides an accurate estimate of delay variations and their impact on design yield. The large number of threads that can be computed in parallel on a GPU suggests a natural fit for the problem of Monte Carlo based SSTA to the GPU platform. Our implementation performs multiple delay simulations at a single gate in parallel. A parallel implementation of the Mersenne Twister pseudo-random number generator on the GPU, followed by Box-Muller transformations (also implemented on the GPU) is used for generating gate delay numbers from a normal distribution. The µ and σ of the pin-to-output delay distributions for all inputs and for every gate, are obtained using a memory lookup, which benefits from the large memory bandwidth of the GPU. Threads which execute in parallel have no data/control dependencies on each other. All threads compute identical instructions, but on different data, as required by the Single Instruction Multiple Data (SIMD) programming semantics of the GPU. Our approach is implemented on a NVIDIA GeForce GTX 8800 GPU card. Our results indicate that our approach can obtain an average speedup of about 260× as compared to a serial CPU implementation. With the recently announced quad 8800 GPU cards, we estimate that our approach would attain a speedup of over 785×. The correctness of the Monte Carlo based SSTA implemented on a GPU has been verified by comparing its results with a CPU based implementation.

Explore More