Keshavan Varadarajan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keshavan Varadarajan is active.

Explore More

Publication

Featured researches published by Keshavan Varadarajan.

ACM Transactions in Embedded Computing Systems | 2009

REDEFINE: Runtime reconfigurable polymorphic ASIC

Mythri Alle; Keshavan Varadarajan; Alexander Fell; C. Ramesh Reddy; Nimmy Joseph; Saptarsi Das; Prasenjit Biswas; Jugantor Chetia; Adarsh Rao; S. K. Nandy; Ranjani Narayan

Emerging embedded applications are based on evolving standards (e.g., MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between different standards and their derivatives. In order to achieve high resource utilization and low power dissipation, we propose REDEFINE, a polymorphic ASIC in which specialized hardware units are replaced with basic hardware units that can create the same functionality by runtime re-composition. It is a “future-proof” custom hardware solution for multiple applications and their derivatives in a domain. In this article, we describe a compiler framework and supporting hardware comprising compute, storage, and communication resources. Applications described in high-level language (e.g., C) are compiled into application substructures. For each application substructure, a set of compute elements on the hardware are interconnected during runtime to form a pattern that closely matches the communication pattern of that particular application. The advantage is that the bounded CEs are neither processor cores nor logic elements as in FPGAs. Hence, REDEFINE offers the power and performance advantage of an ASIC and the hardware reconfigurability and programmability of that of an FPGA/instruction set processor. In addition, the hardware supports custom instruction pipelining. Existing instruction-set extensible processors determine a sequence of instructions that repeatedly occur within the application to create custom instructions at design time to speed up the execution of this sequence. We extend this scheme further, where a kernel is compiled into custom instructions that bear strong producer-consumer relationship (and not limited to frequently occurring sequences of instructions). Custom instructions, realized as hardware compositions effected at runtime, allow several instances of the same to be active in parallel. A key distinguishing factor in majority of the emerging embedded applications is stream processing. To reduce the overheads of data transfer between custom instructions, direct communication paths are employed among custom instructions. In this article, we present the overview of the hardware-aware compiler framework, which determines the NoC-aware schedule of transports of the data exchanged between the custom instructions on the interconnect. The results for the FFT kernel indicate a 25% reduction in the number of loads/stores, and throughput improves by log(n) for n-point FFT when compared to sequential implementation. Overall, REDEFINE offers flexibility and a runtime reconfigurability at the expense of 1.16× in power and 8× in area when compared to an ASIC. REDEFINE implementation consumes 0.1× the power of an FPGA implementation. In addition, the configuration overhead of the FPGA implementation is 1,000× more than that of REDEFINE.

international symposium on microarchitecture | 2006

Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions

Keshavan Varadarajan; S. K. Nandy; Vishal Sharda; Amrutur Bharadwaj; Ravi R. Iyer; Srihari Makineni; Donald Newell

CMPs enable simultaneous execution of multiple applications on the same platforms that share cache resources. Diversity in the cache access patterns of these simultaneously executing applications can potentially trigger inter-application interference, leading to cache pollution. Whereas a large cache can ameliorate this problem, the issues of larger power consumption with increasing cache size, amplified at sub-100nm technologies, makes this solution prohibitive. In this paper, in order to address the issues relating to power-aware performance of caches, we propose a caching structure that addresses the following: 1) Definition of application-specific cache partitions as an aggregation of caching units (molecules). The parameters of each molecule namely size, associativity and line size are chosen so that the power consumed by it and access time are optimal for the given technology. 2) Application-specific resizing of cache partitions with variable and adaptive associativity per cache line, way size and variable line size. 3) A replacement policy that is transparent to the partition in terms of size, heterogeneity in associativity and line size. Through simulation studies we establish the superiority of molecular cache (caches built as aggregations of molecules) that offers a 29% power advantage over that of an equivalently performing traditional cache

applied reconfigurable computing | 2009

Compiling Techniques for Coarse Grained Runtime Reconfigurable Architectures

Mythri Alle; Keshavan Varadarajan; Alexander Fell; S. K. Nandy; Ranjani Narayan

In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer relationship. These HyperOps are hosted on computation structures that are provisioned on demand at runtime. We also report compiler optimizations that collectively reduce the overheads of data-driven computations in runtime reconfigurable architectures. On an average, HyperOps offer a 44% reduction in total execution time and a 18% reduction in management overheads as compared to using basic blocks as coarse grained operations. We show that HyperOps formed using our compiler are suitable to support data flow software pipelining.

application specific systems architectures and processors | 2008

RECONNECT: A NoC for polymorphic ASICs using a low overhead single cycle router

Joseph Nimmy; C. Ramesh Reddy; Keshavan Varadarajan; Mythri Alle; Alexander Fell; S. K. Nandy; Ranjani Narayan

A polymorphic ASIC is a runtime reconfigurable hardware substrate comprising compute and communication elements. It is a ldquofuture proofrdquo custom hardware solution for multiple applications and their derivatives in a domain. Interoperability between application derivatives at runtime is achieved through hardware reconfiguration. In this paper we present the design of a single cycle Network on Chip (NoC) router that is responsible for effecting runtime reconfiguration of the hardware substrate. The router design is optimized to avoid FIFO buffers at the input port and loop back at output crossbar. It provides virtual channels to emulate a non-blocking network and supports a simple X-Y relative addressing scheme to limit the control overhead to 9 bits per packet. The 8times8 honeycomb NoC (RECONNECT) implemented in 130 nm UMC CMOS standard cell library operates at 500 MHz and has a bisection bandwidth of 28.5 GBps. The network is characterized for random, self-similar and application specific traffic patterns that model the execution of multimedia and DSP kernels with varying network loads and virtual channels. Our implementation with 4 virtual channels has an average network latency of 24 clock cycles and throughput of 62.5% of the network capacity for random traffic. For application specific traffic the latency is 6 clock cycles and throughput is 87% of the network capacity.

application specific systems architectures and processors | 2008

Synthesis of application accelerators on Runtime Reconfigurable Hardware

Mythri Alle; Keshavan Varadarajan; Ramesh C. Ramesh; Joseph Nimmy; Alexander Fell; Adarsha Rao; S. K. Nandy; Ranjani Narayan

Application accelerators are predominantly ASICs. The cost of ASIC solutions are order of magnitudes higher than programmable processing cores. Despite this, ASIC solutions are preferred when both high performance and low power is the target. ASICs offer no flexibility in terms of it being able to cater to application derivatives, unless this has been provisioned for at the time of design. In this paper we define the architecture of Runtime Reconfigurable Hardware (RRH) as the platform for application acceleration. The proposed RRH is a homogeneous fabric comprising computing, storage and communicating resources. We also propose a synthesis methodology to realize application written a high level language (HLL) on the RRH. Applications described in HLL is compiled into application substructures. For each application substructure a set of Compute Elements interconnected in a manner that closely matches the communication pattern within it, is allocated. CEs in such a configuration is called a hardware affine. Hardware Affines are carved out on the RRH at runtime. These hardware affines are defined at compile time, and are provisioned at runtime on the fabric. By virtue of the fact that these hardware affines are NOT instruction set processor cores or Logic Elements as in FPGAs, we bear the performance and power advantage of an ASIC, and the hardware reconfigurability/programmability of that of an FPGA/Instruction Set Processor.

field-programmable logic and applications | 2007

REDEFINE: Architecture of a SoC Fabric for Runtime Composition of Computation Structures

A.N. Satrawala; Keshavan Varadarajan; M. Lie; S. K. Nandy; R. Narayan

In this paper we propose the architecture of a SoC fabric onto which applications described in a HLL are synthesized. The fabric is a homogeneous layout of computation, storage and communication resources on silicon. Through a process of composition of resources (as opposed to decomposition of applications), application specific computational structures are defined on the fabric at runtime to realize different modules of the applications in hardware. Applications synthesized on this fabric offers performance comparable to ASICs while retaining the programmability of processing cores. We outline the application synthesis methodology through examples, and compare our results with software implementations on traditional platforms with unbounded resources.

compilers, architecture, and synthesis for embedded systems | 2009

Streaming FFT on REDEFINE-v2: an application-architecture design space exploration

Alexander Fell; Mythri Alle; Keshavan Varadarajan; Prasenjit Biswas; Saptarsi Das; Jugantor Chetia; S. K. Nandy; Ranjani Narayan

In this paper we explore an implementation of a high-throughput, streaming application on REDEFINE-v2, which is an enhancement of REDEFINE. REDEFINE is a polymorphic ASIC combining the flexibility of a programmable solution with the execution speed of an ASIC. In REDEFINE Compute Elements are arranged in an 8x8 grid connected via a Network on Chip (NoC) called RECONNECT, to realize the various macrofunctional blocks of an equivalent ASIC. For a 1024-FFT we carry out an application-architecture design space exploration by examining the various characterizations of Compute Elements in terms of the size of the instruction store. We further study the impact by using application specific, vectorized FUs. By setting up different partitions of the FFT algorithm for persistent execution on REDEFINE-v2, we derive the benefits of setting up pipelined execution for higher performance. The impact of the REDEFINE-v2 micro-architecture for any arbitrary N-point FFT (N > 4096) FFT is also analyzed. We report the various algorithm-architecture tradeoffs in terms of area and execution speed with that of an ASIC implementation. In addition we compare the performance gain with respect to a GPP.

international conference on embedded computer systems: architectures, modeling, and simulation | 2010

Design space exploration of systolic realization of QR factorization on a runtime reconfigurable platform

Prasenjit Biswas; Keshavan Varadarajan; Mythri Alle; S. K. Nandy; Ranjani Narayan

In the world of high performance computing huge efforts have been put to accelerate Numerical Linear Algebra (NLA) kernels like QR Decomposition (QRD) with the added advantage of reconfigurability and scalability. While popular custom hardware solution in form of systolic arrays can deliver high performance, they are not scalable, and hence not commercially viable. In this paper, we show how systolic solutions of QRD can be realized efficiently on REDEFINE, a scalable runtime reconfigurable hardware platform. We propose various enhancements to REDEFINE to meet the custom need of accelerating NLA kernels. We further do the design space exploration of the proposed solution for any arbitrary application of size n×n. We determine the right size of the sub-array in accordance with the optimal pipeline depth of the core execution units and the number of such units to be used per sub-array.

ieee computer society annual symposium on vlsi | 2010

Accelerating Numerical Linear Algebra Kernels on a Scalable Run Time Reconfigurable Platform

Prasenjit Biswas; Pramod Udupa; Rajdeep Mondal; Keshavan Varadarajan; Mythri Alle; S. K. Nandy; Ranjani Narayan

Numerical Linear Algebra (NLA) kernels are at the heart of all computational problems. These kernels require hardware acceleration for increased throughput. NLA Solvers for dense and sparse matrices differ in the way the matrices are stored and operated upon although they exhibit similar computational properties. While ASIC solutions for NLA Solvers can deliver high performance, they are not scalable, and hence are not commercially viable. In this paper, we show how NLA kernels can be accelerated on REDEFINE, a scalable runtime reconfigurable hardware platform. Compared to a software implementation, Direct Solver (Modified Faddeevs algorithm) on REDEFINE shows a 29X improvement on an average and Iterative Solver (Conjugate Gradient algorithm) shows a 15-20% improvement. We further show that solution on REDEFINE is scalable over larger problem sizes without any notable degradation in performance.

applied reconfigurable computing | 2011

Dataflow graph partitioning for optimal spatio-temporal computation on a coarse grain reconfigurable architecture

Ratna Krishnamoorthy; Keshavan Varadarajan; Masahiro Fujita; Mythri Alle; S. K. Nandy; Ranjani Narayan

Coarse Grain Reconfigurable Architectures(CGRA) support Spatial and Temporal computation to speedup execution and reduce reconfiguration time. Thus compilation involves partitioning instructions spatially and scheduling them temporally. We extend Edge-Betweenness Centrality scheme, originally used for detecting community structures in social and biological networks, for partitioning instructions of a dataflow graph. Comparisons of execution time for several applications run on a simulator for REDEFINE, a CGRA proposed in literature, indicate that Centrality scheme outperforms several other schemes with 2-18% execution time speedup.

Explore More