Srinidhi Kestur
Pennsylvania State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Srinidhi Kestur.
ieee computer society annual symposium on vlsi | 2010
Srinidhi Kestur; John D. Davis; Oliver Williams
High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using double-precision floating point on an FPGA, CPU and GPU. On the CPU and GPU, we utilize standard libraries on state-of-the-art devices. On the FPGA, we have developed parameterized modular implementations for the dot-product and Gaxpy or matrix-vector multiplication. In order to obtain optimal performance for any aspect ratio of the matrices, we have designed a high-throughput accumulator to perform an efficient reduction of floating point values. To support scalability to large data-sets, we target the BEE3 FPGA platform. We use performance and energy efficiency as metrics to compare the different platforms. Results show that FPGAs offer comparable performance as well as 2.7 to 293 times better energy efficiency for the test cases that we implemented on all three platforms.
field-programmable custom computing machines | 2012
Srinidhi Kestur; Mi Sun Park; Jagdish Sabarad; Dharav Dantara; Vijaykrishnan Narayanan; Yang Chen; Deepak Khosla
A significant challenge in creating machines with artificial vision is designing systems which can process visual information as efficiently as the brain. To address this challenge, we identify key algorithms which model the process of attention and recognition in the visual cortex of mammals. This paper presents Cover - an FPGA framework for generating systems which can potentially emulate the visual cortex. We have designed accelerators for models of attention and recognition in the cortex and integrated them to realize an end-to-end attention-recognition system. Evaluation of our system on a Dinigroup multi-FPGA platform shows high performance and accuracy for attention and recognition systems and speedups over existing CPU, GPU and FPGA implementations. Results show that our end-to-end system which emulates the cortex can achieve near real-time speeds for high resolution images. This system can be applied to many artificial vision applications such as augmented virtual reality and autonomous vehicle navigation.
asia and south pacific design automation conference | 2012
Jagdish Sabarad; Srinidhi Kestur; Mi Sun Park; Dharav Dantara; Vijaykrishnan Narayanan; Yang Chen; Deepak Khosla
Advances in neuroscience have enabled researchers to develop computational models of auditory, visual and learning perceptions in the human brain. HMAX, which is a biologically inspired model of the visual cortex, has been shown to outperform standard computer vision approaches for multi-class object recognition. HMAX, while computationally demanding, can be potentially applied in various applications such as autonomous vehicle navigation, unmanned surveillance and robotics. In this paper, we present a reconfigurable hardware accelerator for the time-consuming S2 stage of the HMAX model. The accelerator leverages spatial parallelism, dedicated wide data buses with on-chip memories to provide an energy efficient solution to enable adoption into embedded systems. We present a systolic array-based architecture which includes a run-time reconfigurable convolution engine which can perform multiple variable-sized convolutions in parallel. An automation flow is described for this accelerator which can generate optimal hardware configurations for a given algorithmic specification and also perform run-time configuration and execution seamlessly. Experimental results on Virtex-6 FPGA platforms show 5X to 11X speedups and 14X to 33X higher performance-per-Watt over a CNS-based implementation on a Tesla GPU.
design, automation, and test in europe | 2011
Srinidhi Kestur; Dharav Dantara; Vijaykrishnan Narayanan
Reconfigurable hardware such as FPGAs are being increasingly employed for accelerating compute-intensive applications. While recent advances in technology have increased the capacity of FPGAs, lack of standard models for developing custom accelerators creates issues with scalability and compatibility. We present SHARC — Streaming Hardware Accelerator with Run-time Configurability, for an FPGA-based accelerator. This model is at a lower-level compared to existing stream processing models and provides the hardware designer with a flexible platform for developing custom accelerators. The SHARC model provides a generic interface for each hardware module and a hierarchical structure for parallelism at multiple levels in an accelerator. It also includes a parameterization and hierarchical run-time reconfiguration framework to enable hardware reuse for flexible yet high throughput design. This model is very well suited for compute-intensive applications in areas such as real-time vision and signal processing, where stream processing provides enormous performance benefits. We present a case-study by implementing a bio-inspired Saliency-based visual attention system using the proposed model and demonstrate the benefits of run-time reconfiguration. Experimental results show about 5X speedup over an existing CPU implementation and up to 14X higher Performance-per-Watt over a relevant GPU implementation.
field-programmable custom computing machines | 2010
Srinidhi Kestur; Sungho Park; Kevin M. Irick; Vijaykrishnan Narayanan
We present an FPGA accelerator for the Non-uniform Fast Fourier Transform, which is a technique to reconstruct images from arbitrarily sampled data. We accelerate the compute-intensive interpolation step of the NuFFT Gridding algorithm by implementing it on an FPGA. In order to ensure efficient memory performance, we present a novel FPGA implementation for Geometric Tiling based sorting of the arbitrary samples. The convolution is then performed by a novel Data Translation architecture which is composed of a multi-port local memory, dynamic coordinate-generator and a plug-and-play kernel pipeline. Our implementation is in single-precision floating point and has been ported onto the BEE3 platform. Experimental results show that our FPGA implementation can generate fairly high performance without sacrificing flexibility for various data-sizes and kernel functions. We demonstrate up to 8X speedup and up to 27 times higher performance-per-watt over a comparable CPU implementation and up to 20% higher performance-per-watt when compared to a relevant GPU implementation.
design automation conference | 2011
Srinidhi Kestur; Kevin M. Irick; Sungho Park; Ahmed Al Maashri; Vijaykrishnan Narayanan; Chaitaili Chakrabarti
Gridding is a method of interpolating irregularly sampled data on to a uniform grid and is a critical image reconstruction step in several applications which operate on non-Cartesian sampled data. In this paper, we present an algorithm-architecture co-design framework for accelerating gridding using FPGAs. We present a parameterized hardware library for accelerating gridding to support both arbitrary and regular trajectories. We further describe our kernel automation framework which supports several kernel functions through look-up-table (LUT) based Taylor polynomial evaluation. This framework is integrated using an in-house multi-FPGA development platform which provides hardware infrastructure for integrating custom accelerators. Design-space exploration is enabled by an automation flow which allows system generation from an algorithm specification. We further provide several case studies by realizing systems for nonuniform fast Fourier transform (NuFFT) with different parameter sets and porting them on to the BEE3 platform. Results show speedups of more than 16X and 2X over existing CPU and FPGA implementations respectively, and up to 5.5 times higher performance-per-watt over a comparable GPU implementation.
design, automation, and test in europe | 2012
Mi Sun Park; Srinidhi Kestur; Jagdish Sabarad; Vijaykrishnan Narayanan; Mary Jane Irwin
Recently significant advances have been achieved in understanding the visual information processing in the human brain. The focus of this work is on the design of an architecture to support HMAX, a widely accepted model of the human visual pathway. The computationally intensive nature of HMAX and wide applicability in real-time visual analysis application makes the design of hardware accelerators a key necessity. In this work, we propose a configurable accelerator mapped efficiently on a FPGA to realize real-time feature extraction for vision-based classification algorithms. Our innovations include the efficient mapping of the proposed architecture on the FPGA as well as the design of an efficient memory structure. Our evaluation shows that the proposed approach is significantly faster than other contemporary solutions on different platforms.
Proceedings of SPIE | 2009
Kevin M. Irick; Michael DeBole; Sungho Park; Ahmed Al Maashri; Srinidhi Kestur; Chi Li Yu; Narayanan Vijaykrishnan
FPGAs have emerged as the preferred platform for implementing real-time signal processing applications. In the sub-45nm technologies, FPGAs offer significant cost and design-time advantages over application-specific custom chips and consume significantly less power than general-purpose processors while maintaining, or improving performance. Moreover, FPGAs are more advantageous than GPUs in their support for control-intensive applications, custom bit-precision operations, and diverse system interface protocols. Nonetheless, a significant inhibitor to the widespread adoption of FPGAs has been the expertise required to effectively realize functional designs that maximize application performance. While there have been several academic and commercial efforts to improve the usability of FPGAs, they have primarily focused on easing the tasks of an expert FPGA designer rather than increasing the usability offered to an application developer. In this work, the design of a scalable algorithmic-level design framework for FPGAs, AlgoFLEX, is described. AlgoFLEX offers rapid algorithmic level composition and exploration while maintaining the performance realizable from a fully custom, albeit difficult and laborious, design effort. The framework masks aspects of accelerator implementation, mapping, and communication while exposing appropriate algorithm tuning facilities to developers and system integrators. The effectiveness of the AlgoFLEX framework is demonstrated by rapidly mapping a class of image and signal processing applications to a multi-FPGA platform.
design automation conference | 2013
Mi Sun Park; Chuanjun Zhang; Michael DeBole; Srinidhi Kestur; Vijaykrishnan Narayanan; Mary Jane Irwin
Video and image content has begun to play a growing role in many applications, ranging from video games to autonomous self-driving vehicles. In this paper, we present accelerators for gist-based scene recognition, saliency-based attention, and HMAX-based object recognition that have multiple uses and are based on the current understanding of the vision systems found in the visual cortex of the mammalian brain. By integrating them into a two-level hierarchical system, we improve recognition accuracy and reduce computational time. Results of our accelerator prototype on a multi-FPGA system show real-time performance and high recognition accuracy with large speedups over existing CPU, GPU and FPGA implementations.
field programmable gate arrays | 2011
Srinidhi Kestur; Dharav Dantara; Vijaykrishnan Narayanan
Oriented filters are used in many early vision and image processing tasks for feature extraction at arbitrary orientations. Steerable filters are a class of filters in which a filter of arbitrary orientation is synthesized as a linear combination of a set of basis filters. In this work, we describe a streaming implementation of a steerable filter on FPGAs, which includes a two-dimensional convolution filter and a modulator for modulation by a set of oriented sine waves. We present a highly configurable streaming 2D convolution implementation and a novel separable look-up-table based implementation of the modulation step. This steerable filter has been extended to multiple resolutions to realize a steerable pyramid filter. Experimental results on a Virtex6 FPGA show that the steerable pyramid filter provides up to 14X and 21X speedups over related FPGA and CPU implementations respectively. This work was supported in part by a grant from NSF 0916887 and DARPA Neovision2 programs.