Kanupriya Gulati | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kanupriya Gulati is active.

Explore More

Publication

Featured researches published by Kanupriya Gulati.

design automation conference | 2008

Towards acceleration of fault simulation using graphics processing units

Kanupriya Gulati; Sunil P. Khatri

In this paper, we explore the implementation of fault simulation on a graphics processing unit (GPU). In particular, we implement a fault simulator that exploits thread level parallelism. Fault simulation is inherently parallelizable, and the large number of threads that can be computed in parallel on a GPU results in a natural fit for the problem of fault simulation. Our implementation fault- simulates all the gates in a particular level of a circuit, including good and faulty circuit simulations, for all patterns, in parallel. Since GPUs have an extremely large memory bandwidth, we implement each of our fault simulation threads (which execute in parallel with no data dependencies) using memory lookup. Fault injection is also done along with gate evaluation, with each thread using a different fault injection mask. All threads compute identical instructions, but on different data, as required by the Single Instruction Multiple Data (SIMD) programming semantics of the GPU. Our results, implemented on a NVIDIA GeForce GTX 8800 GPU card, indicate that our approach is on average 35 x faster when compared to a commercial fault simulation engine. With the recently announced Tesla GPU servers housing up to eight GPUs, our approach would be potentially 238 times faster. The correctness of the GPU based fault simulator has been verified by comparing its result with a CPU based fault simulator.

asia and south pacific design automation conference | 2009

Fast circuit simulation on graphics processing units

Kanupriya Gulati; John F. Croix; Sunil P. Khatri; Rahm Shastry

SPICE based circuit simulation is a traditional workhorse in the VLSI design process. Given the pivotal role of SPICE in the IC design flow, there has been significant interest in accelerating SPICE. Since a large fraction (on average 75%) of the SPICE runtime is spent in evaluating transistor model equations, a significant speedup can be availed if these evaluations are accelerated. This paper reports on our early efforts to accelerate transistor model evaluations using a Graphics Processing Unit (GPU). We have integrated this accelerator with a commercial fast SPICE tool. Our experiments demonstrate that significant speedups (2.36× on average) can be obtained. The asymptotic speedup that can be obtained is about 4×. We demonstrate that with circuits consisting of as few as about 1000 transistors, speedups in the neighborhood of this asymptotic value can be obtained. By utilizing the recently announced (but not currently available) quad GPU systems, this speedup could be enhanced further, especially for larger designs.

asia and south pacific design automation conference | 2009

Accelerating statistical static timing analysis using graphics processing units

Kanupriya Gulati; Sunil P. Khatri

We explore the implementation of Monte Carlo based statistical static timing analysis (SSTA) on a Graphics Processing Unit (GPU). SSTA via Monte Carlo simulations is a computationally expensive, but important step required to achieve design timing closure. It provides an accurate estimate of delay variations and their impact on design yield. The large number of threads that can be computed in parallel on a GPU suggests a natural fit for the problem of Monte Carlo based SSTA to the GPU platform. Our implementation performs multiple delay simulations at a single gate in parallel. A parallel implementation of the Mersenne Twister pseudo-random number generator on the GPU, followed by Box-Muller transformations (also implemented on the GPU) is used for generating gate delay numbers from a normal distribution. The µ and σ of the pin-to-output delay distributions for all inputs and for every gate, are obtained using a memory lookup, which benefits from the large memory bandwidth of the GPU. Threads which execute in parallel have no data/control dependencies on each other. All threads compute identical instructions, but on different data, as required by the Single Instruction Multiple Data (SIMD) programming semantics of the GPU. Our approach is implemented on a NVIDIA GeForce GTX 8800 GPU card. Our results indicate that our approach can obtain an average speedup of about 260× as compared to a serial CPU implementation. With the recently announced quad 8800 GPU cards, we estimate that our approach would attain a speedup of over 785×. The correctness of the Monte Carlo based SSTA implemented on a GPU has been verified by comparing its results with a CPU based implementation.

ACM Transactions on Design Automation of Electronic Systems | 2009

FPGA-based hardware acceleration for Boolean satisfiability

Kanupriya Gulati; Suganth Paul; Sunil P. Khatri; Srinivas Patil; Abhijit Jas

We present an FPGA-based hardware solution to the Boolean satisfiability (SAT) problem, with the main goals of scalability and speedup. In our approach the traversal of the implication graph as well as conflict clause generation are performed in hardware, in parallel. The experimental results and their analysis, along with the performance models are discussed. We show that an order of magnitude improvement in runtime can be obtained over MiniSAT (the best-in-class software based approach) by using a Virtex-4 (XC4VFX140) FPGA device. The resulting system can handle instances with as many as 10K variables and 280K clauses.

Journal of Electronic Testing | 2010

Fault Table Computation on GPUs

Kanupriya Gulati; Sunil P. Khatri

In this paper, we explore the implementation of fault table generation on a Graphics Processing Unit (GPU). A fault table is essential for fault diagnosis and fault detection in VLSI testing and debug. Generating a fault table requires extensive fault simulation, with no fault dropping, and is extremely expensive from a computational standpoint. Fault simulation is inherently parallelizable, and the large number of threads that a GPU can operate on in parallel can be employed to accelerate fault simulation, and thereby accelerate fault table generation. Our approach, called GFTABLE, employs a pattern parallel approach which utilizes both bit-parallelism and thread-level parallelism. Our implementation is a significantly modified version of FSIM, which is pattern parallel fault simulation approach for single core processors. Like FSIM, GFTABLE utilizes critical path tracing and the dominator concept to reduce runtime. Further modifications to FSIM allow us to maximally harness the GPU’s huge memory bandwidth and high computational power. Our approach does not store the circuit (or any part of the circuit) on the GPU. Efficient parallel reduction operations are implemented in our implementation of GFTABLE. We compare our performance to FSIM*, which is FSIM modified to generate a fault table on a single core processor. Our experiments indicate that GFTABLE, implemented on a single NVIDIA Quadro FX 5800 GPU card, can generate a fault table for 0.5 million test patterns on average 15.68× faster when compared with FSIM*. With the NVIDIA Tesla server, our approach would be potentially 89.57× faster.

international conference on computer aided design | 2006

Network coding for routability improvement in VLSI

Nikhil Jayakumar; Kanupriya Gulati; Sunil P. Khatri; Alex Sprintson

With the standard approach for establishing multicast connections over a network, network nodes are utilized to forward and duplicate the packets received over the incoming links. Recently, there has been a significant interest in a novel paradigm of network coding. Network coding generalizes the traditional routing approach by allowing the network nodes to generate new packets by performing algebraic operations on packets received over the incoming links. It has been shown that network coding can increase the throughput of multicast communication. In this paper, we explore the benefits of network coding for improving the routing characteristics of VLSI designs. We demonstrate that when data has to be routed across the IC, it is often beneficial to perform network coding. Initial results demonstrate that network coding can result in a healthy reduction in wire length, wire area, interconnect power as well as the active area associated with the interconnects. This comes at a small delay penalty

Archive | 2010

Hardware Acceleration of EDA Algorithms

Kanupriya Gulati; Sunil P. Khatri

In what case do you like reading so much? What about the type of the hardware acceleration of eda algorithms book? The needs to read? Well, everybody has their own reason why should read some books. Mostly, it will relate to their necessity to get knowledge from the book and want to read just to get entertainment. Novels, story book, and other entertaining books become so popular this day. Besides, the scientific books will also be the best reason to choose, especially for the students, teachers, doctors, businessman, and other professions who are fond of reading.

allerton conference on communication, control, and computing | 2009

Highly parallel decoding of space-time codes on graphics processing units

Kalyana C. Bollapalli; Yiyue Wu; Kanupriya Gulati; Sunil P. Khatri; A. Robert Calderbank

Graphics Processing Units (GPUs) with a few hundred extremely simple processors represent a paradigm shift for highly parallel computations. We use this emergent GPU architecture to provide a first demonstration of the feasibility of real time ML decoding (in software) of a high rate space-time block code that is representative of codes incorporated in 4th generation wireless standards such as WiMAX and LTE. The decoding algorithm is conditional optimization which reduces to a parallel calculation that is a natural fit to the architecture of low cost GPUs. Experimental results demonstrate that asymptotically the GPU implementation is more than 700 times faster than a standard serial implementation. These results suggest that GPU architectures have the potential to improve the cost / performance tradeoff of 4th generation wireless base stations. Additional benefits might include reducing the time required for system development and the time required for configuration and testing of wireless base stations.

great lakes symposium on vlsi | 2008

Improving FPGA routability using network coding

Kanupriya Gulati; Sunil P. Khatri

With current technology trends, FPGA routing is an important problem, since routing in FPGAs contributes significantly to delay and resource utilization, as compared to the logic portion of FPGAs. In this paper we improve the FPGA routing characteristics by applying the technique of network coding. This relatively new technique was developed in the context of communication networks, and proven to improve network throughput, reliability, etc. To the best of our knowledge, this paper is the first to apply network coding to improve FPGA routing. Our preliminary results are implemented in the VPR 4.30 tool suite. We demonstrate (on average) a 14% reduction in worst case delay, a 3% reduction in wirelength and a healthy reduction in the routing track count on several MCNC benchmark circuits, over the current best known results. By using carefully generated cost models for applying the technique of network coding, we show that this routability improvement is accompanied by a zero percent CLB utilization overhead and < 0.5% runtime penalty. Our approach is orthogonal to existing routing algorithms, and therefore can be applied in tandem with them.

great lakes symposium on vlsi | 2010

Boolean satisfiability on a graphics processor

Kanupriya Gulati; Sunil P. Khatri

Boolean Satisfiability (SAT) is a core NP-complete problem. Several heuristic software and hardware approaches have been proposed to solve this problem. In this paper we present a Boolean satisfiablity approach with a new GPU-enhanced variable ordering heuristic. Our results demonstrate that over several satisfiable and unsatisfiable benchmarks, our technique (MESP) performs better than MiniSAT. We show a 2.35× speedup on average, over 68 from the SAT Race (2008) competition.

Explore More