Arvind Sudarsanam
Utah State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Arvind Sudarsanam.
Iet Computers and Digital Techniques | 2010
Arvind Sudarsanam; Robert Collier Barnes; J. Carver; Ramachandra Kallam; Aravind Dasu
Field programmable gate arrays (FPGAs) are increasingly being adopted as the primary on-board computing system for autonomous deep space vehicles. There is a need to support several complex applications for navigation and image processing in a rapidly responsive on-board FPGA-based computer. Developing such a computer requires the designer to explore and combine several design concepts such as systolic array (SA) design, hardware-software partitioning and partial dynamic reconfiguration (PDR). In this study a microprocessor/co-processor design that can simultaneously accelerate multiple single precision floating-point algorithms is proposed. Two such algorithms are extended Kalman filter (EKF) and discrete wavelet transform (DWT). Key contributions include (i) polymorphic systolic array (PolySA), comprising partial reconfigurable regions that can accelerate algorithms amenable to being mapped onto linear SAs and (ii) performance model to predict the overall execution time of EKF algorithm on the proposed PolySA architecture. When implemented on a low-end Xilinx Virtex4 SX35 FPGA, the design provides a speed-up of at least 4.18 x and 6.61 x over a state-of-the-art microprocessor used in spacecraft systems for the EKF and DWT algorithms, respectively. The performance of EKF algorithm on the proposed PolySA architecture was compared against the performance on two types of conventional (non-polymorphic) hardware architectures and the results showed that the proposed architecture outperformed the other two architectures in most of the test cases.
IEEE Computer Architecture Letters | 2009
Arvind Sudarsanam; Ramachandra Kallam; Aravind Dasu
Partial bitstream relocation (PBR) on FPGAs has been gaining attention in recent years as a potentially promising technique to scale parallelism of accelerator architectures at run time, enhance fault tolerance, etc. PBR techniques to date have focused on reading inactive bitstreams stored in memory, on-chip or off-chip, whose contents are generated for a specific partial reconfiguration region (PRR) and modified on demand for configuration into a PRR at a different location. As an alternative, we propose a PRR-PRR relocation technique to generate source and destination addresses, read the bitstream from an active PRR (source) in a non-intrusive manner, and write it to destination PRR. We describe two options of realizing this on Xilinx Virtex 4 FPGAs: (a) hardware-based accelerated relocation circuit (ARC) and (b) a software solution executed on Microblaze. A comparative performance analysis to highlight the speed-up obtained using ARC is presented. For real test cases, performance of our implementations are compared to estimated performances of two state of the art methods.
international conference on parallel and distributed systems | 2004
Arvind Sudarsanam; Mayur Srinivasan; Sethuraman Panchanathan
Reconfigurable computing is an emerging paradigm of research that offers cost-effective solutions for computationally intensive applications through hardware reuse. There is a growing need in this domain for techniques to exploit parallelism inherent in the target application and to schedule the parallelized application. This paper proposes a method to estimate the optimal number of resources through critical path analysis while keeping resource utilization near optimal. We also propose an algorithm to optimally schedule the parallel threads of execution in linear time. Our algorithm is based on the idea of enhanced partial critical path (ePCP) and handles memory latencies and reconfiguration overheads. Results obtained show the effectiveness of our approach over other critical path based methods.
international parallel and distributed processing symposium | 2005
Dasu Aravind; Arvind Sudarsanam
This paper proposes a novel common subgraph extraction algorithm which aims to minimize the total number of gates (reconfiguration area overhead) involved in implementing compute-intensive scientific and media applications using reconfigurable architectures. Motivation behind the proposed research is illustrated using an example from Biochemical Algorithms Library (BALL). The design of novel context adaptable architectures to implement common subgraphs is also proposed with an example from the video warping functions of the MPEG-4 standard. Three different models of mapping such architectures onto hybrid/ pure FPGA systems are proposed. Estimates obtained by applying these techniques and architectures for various media and scientific functions are shown.
Iet Computers and Digital Techniques | 2009
Jonathan Phillips; Arvind Sudarsanam; Harikrishna Samala; Ramachandra Kallam; J. Carver; Aravind Dasu
The configurable nature of field-programmable gate arrays (FPGAs) has allowed designers to take advantage of various data flow characteristics in application kernels to create custom architecture implementations, by optimising instruction level paralleism (ILP) and pipelining at the register transfer level. However, not all applications are composed of pure data flow kernels. Intermingling of control and data flows in applications offers more interesting challenges in creating custom architectures. The authors present one possible way to take advantage of correlations that may be present among data flow graphs (DFGs) embedded in control flow graphs. In certain cases, where there is sufficient correlation and ILP, the proposed context adaptable architecture (CAA) design methodology results in an interesting and useful custom architecture for such embedded DFGs. Certain other application characteristics may demand the use of alternative methodologies such as partial and dynamic reconfiguration (PDR) and a mixture of PDR and common sub-graph methods (PDR-CSG). The authors present a rigorous analysis, combined with some benchmarking efforts to showcase the differences, advantages and disadvantages of the CAA methodology with other methodologies. The authors also present an analysis of how the core underlying algorithm in our methodology compares with other published algorithms and the differences in resulting designs on an FPGA for a sample set of test cases.
ieee international symposium on parallel distributed processing workshops and phd forum | 2010
Shant Chandrakar; Abraham Clements; Arvind Sudarsanam; Aravind Dasu
Fast Block Matching (FBM) algorithms for video compression are well suited for acceleration using parallel data-path architecture on Field Programmable Gate Arrays (FPGAs). However, designing an efficient on-chip memory subsystem to provide the required throughput to this parallel data-path architecture is a complex problem. This paper proposes a memory architecture template that is explored using a Bounded Set algorithm to design efficient on-chip memory subsystems for FBM algorithms. The resulting memory subsystems are compared with three existing memory subsystems. Results show that our memory subsystems can provide full parallelism in majority of test cases and can process integer pixels of a 1080p video sequence up to a rate of 275 frames per second.
International Journal of Computers and Applications | 2010
Arvind Sudarsanam; Thomas Hauser; Aravind Dasu; Seth Young
Abstract This paper presents an approach to explore a commercial multi field programmable gate array (FPGA) system as high performance accelerator and the problem of solving an LU decomposed linear system of equations using forward and back substitution is addressed. Block-based right-hand-side solver algorithm is described and a novel data flow and memory architectures that can support arbitrary data types, block sizes and matrix sizes is proposed. These architectures have been implemented on a multi-FPGA system. Capabilities of the accelerator system are pushed to its limits by implementing the problem for double precision complex floatingpoint data. Detailed timing data is presented and augmented with data from a performance model proposed in this paper. Performance of the accelerator system is evaluated against that of a state of the art low power Beowulf cluster node running an optimized LAPACK implementation. Both systems are compared using the power efficiency (performance/watt) metric. FPG A system is about eleven times more power efficient than the compute node of a cluster.
International Journal of Reconfigurable Computing | 2009
Arvind Sudarsanam; Aravind Dasu; Karthik Vaithianathan
Design of flexible multimedia accelerators that can cater to multiple algorithms is being aggressively pursued in the media processors community. Such an approach is justified in the era of sub-45 nmtechnology where an increasingly dominating leakage power component is forcing designers to make the best possible use of on-chip resources. In this paper we present an analysis of two commonly used window-based operations (sum of absolute differences and mean squared error) across a variety of search patterns and block sizes (2 × 3, 5 × 5, etc.). We propose a context adaptable architecture that has (i) configurable 2D systolic array and (ii) 2D Configurable Register Array (CRA). CRA can cater to variable pixel access patterns while reusing fetched pixels across search windows. Benefits of proposed architecture when compared to 15 other published architectures are adaptability, high throughput, and low latency at a cost of increased footprint, when ported on a Xilinx FPGA.
field-programmable technology | 2006
Seth Young; Arvind Sudarsanam; Aravind Dasu; Thomas Hauser
LU matrix decomposition is a linear algebra algorithm used to reduce the complexity required to solve a large system of linear equations. Large systems of equations frequently need to be solved in physics, engineering, and computational chemistry. In the hardware implementation of such LU algorithms supporting modules must be included which handle the transfer of memory between the disk and processing nodes. This paper looks at the data transfer hardware which supports an implementation of a block-based LU algorithm on a multi-FPGA system. Preliminary results are provided which show the required areas and latencies of these designs
field-programmable technology | 2005
Arvind Sudarsanam; Dasu Aravind
This paper addresses the problem of solving a system of linear interval equations (an NP-hard problem), wherein the co-efficients on the LHS and the RHS are all represented using intervals. This problem is transformed into a global optimization problem and a modified branch and bound algorithm suited for an FPGA-based implementation is proposed. This algorithm is modified to extract parallelism and further speed-up is achieved by pipelining the implementation. The implementation was designed using Xilinx 1SE 6.1 and VHDL was the design entry language. A speed-up of 14 for a Xilinx Virtex 2P30 FPGA over a 1.5 GHz Intel Centrino processor based implementation was obtained.