Mythri Alle | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mythri Alle is active.

Explore More

Publication

Featured researches published by Mythri Alle.

ACM Transactions in Embedded Computing Systems | 2009

REDEFINE: Runtime reconfigurable polymorphic ASIC

Mythri Alle; Keshavan Varadarajan; Alexander Fell; C. Ramesh Reddy; Nimmy Joseph; Saptarsi Das; Prasenjit Biswas; Jugantor Chetia; Adarsh Rao; S. K. Nandy; Ranjani Narayan

Emerging embedded applications are based on evolving standards (e.g., MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between different standards and their derivatives. In order to achieve high resource utilization and low power dissipation, we propose REDEFINE, a polymorphic ASIC in which specialized hardware units are replaced with basic hardware units that can create the same functionality by runtime re-composition. It is a “future-proof” custom hardware solution for multiple applications and their derivatives in a domain. In this article, we describe a compiler framework and supporting hardware comprising compute, storage, and communication resources. Applications described in high-level language (e.g., C) are compiled into application substructures. For each application substructure, a set of compute elements on the hardware are interconnected during runtime to form a pattern that closely matches the communication pattern of that particular application. The advantage is that the bounded CEs are neither processor cores nor logic elements as in FPGAs. Hence, REDEFINE offers the power and performance advantage of an ASIC and the hardware reconfigurability and programmability of that of an FPGA/instruction set processor. In addition, the hardware supports custom instruction pipelining. Existing instruction-set extensible processors determine a sequence of instructions that repeatedly occur within the application to create custom instructions at design time to speed up the execution of this sequence. We extend this scheme further, where a kernel is compiled into custom instructions that bear strong producer-consumer relationship (and not limited to frequently occurring sequences of instructions). Custom instructions, realized as hardware compositions effected at runtime, allow several instances of the same to be active in parallel. A key distinguishing factor in majority of the emerging embedded applications is stream processing. To reduce the overheads of data transfer between custom instructions, direct communication paths are employed among custom instructions. In this article, we present the overview of the hardware-aware compiler framework, which determines the NoC-aware schedule of transports of the data exchanged between the custom instructions on the interconnect. The results for the FFT kernel indicate a 25% reduction in the number of loads/stores, and throughput improves by log(n) for n-point FFT when compared to sequential implementation. Overall, REDEFINE offers flexibility and a runtime reconfigurability at the expense of 1.16× in power and 8× in area when compared to an ASIC. REDEFINE implementation consumes 0.1× the power of an FPGA implementation. In addition, the configuration overhead of the FPGA implementation is 1,000× more than that of REDEFINE.

application-specific systems, architectures, and processors | 2006

High Performance VLSI Architecture Design for H.264 CAVLC Decoder

Mythri Alle; Jayanta Biswas; S. K. Nandy

H.264 video standard achieves high quality video along with high data compression when compared to other existing video standards. H.264 uses context-based adaptive variable length coding (CAVLC) to code residual data in baseline profile. In this paper, the authors describe a novel architecture for CAVLC decoder, including coeff_token decoder, level decoder, total_zeros decoder and run_before decoder. UMC library in 0.13mu CMOS technology is used to synthesize the proposed design. The proposed design reduces chip area and improves critical path performance of CAVLC decoder in comparison with (Chang et al., 2005). Macroblock level (including luma and chroma) pipeline processing for CAVLC is implemented with an average of 141 cycles (including pipeline buffering) per macroblock at 250MHz clock frequency. To compare our results with (Chang et al., 2005) clock frequency is constrained to 125MHz. The area required for the proposed architecture is 17586 gates, which is 22.1% improvement in comparison to (Chang et al., 2005). The authors obtain a throughput of 1.73 * 106 macroblocks/second, which is 28% higher than that reported in [1]. The proposed design meets the processing requirement of 1080HD video at 30frames/seconds

applied reconfigurable computing | 2009

Compiling Techniques for Coarse Grained Runtime Reconfigurable Architectures

Mythri Alle; Keshavan Varadarajan; Alexander Fell; S. K. Nandy; Ranjani Narayan

In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer relationship. These HyperOps are hosted on computation structures that are provisioned on demand at runtime. We also report compiler optimizations that collectively reduce the overheads of data-driven computations in runtime reconfigurable architectures. On an average, HyperOps offer a 44% reduction in total execution time and a 18% reduction in management overheads as compared to using basic blocks as coarse grained operations. We show that HyperOps formed using our compiler are suitable to support data flow software pipelining.

application specific systems architectures and processors | 2008

RECONNECT: A NoC for polymorphic ASICs using a low overhead single cycle router

Joseph Nimmy; C. Ramesh Reddy; Keshavan Varadarajan; Mythri Alle; Alexander Fell; S. K. Nandy; Ranjani Narayan

A polymorphic ASIC is a runtime reconfigurable hardware substrate comprising compute and communication elements. It is a ldquofuture proofrdquo custom hardware solution for multiple applications and their derivatives in a domain. Interoperability between application derivatives at runtime is achieved through hardware reconfiguration. In this paper we present the design of a single cycle Network on Chip (NoC) router that is responsible for effecting runtime reconfiguration of the hardware substrate. The router design is optimized to avoid FIFO buffers at the input port and loop back at output crossbar. It provides virtual channels to emulate a non-blocking network and supports a simple X-Y relative addressing scheme to limit the control overhead to 9 bits per packet. The 8times8 honeycomb NoC (RECONNECT) implemented in 130 nm UMC CMOS standard cell library operates at 500 MHz and has a bisection bandwidth of 28.5 GBps. The network is characterized for random, self-similar and application specific traffic patterns that model the execution of multimedia and DSP kernels with varying network loads and virtual channels. Our implementation with 4 virtual channels has an average network latency of 24 clock cycles and throughput of 62.5% of the network capacity for random traffic. For application specific traffic the latency is 6 clock cycles and throughput is 87% of the network capacity.

application specific systems architectures and processors | 2008

Synthesis of application accelerators on Runtime Reconfigurable Hardware

Mythri Alle; Keshavan Varadarajan; Ramesh C. Ramesh; Joseph Nimmy; Alexander Fell; Adarsha Rao; S. K. Nandy; Ranjani Narayan

Application accelerators are predominantly ASICs. The cost of ASIC solutions are order of magnitudes higher than programmable processing cores. Despite this, ASIC solutions are preferred when both high performance and low power is the target. ASICs offer no flexibility in terms of it being able to cater to application derivatives, unless this has been provisioned for at the time of design. In this paper we define the architecture of Runtime Reconfigurable Hardware (RRH) as the platform for application acceleration. The proposed RRH is a homogeneous fabric comprising computing, storage and communicating resources. We also propose a synthesis methodology to realize application written a high level language (HLL) on the RRH. Applications described in HLL is compiled into application substructures. For each application substructure a set of Compute Elements interconnected in a manner that closely matches the communication pattern within it, is allocated. CEs in such a configuration is called a hardware affine. Hardware Affines are carved out on the RRH at runtime. These hardware affines are defined at compile time, and are provisioned at runtime on the fabric. By virtue of the fact that these hardware affines are NOT instruction set processor cores or Logic Elements as in FPGAs, we bear the performance and power advantage of an ASIC, and the hardware reconfigurability/programmability of that of an FPGA/Instruction Set Processor.

international conference on consumer electronics | 2007

High performance VLSI implementation for H.264 Inter/Intra prediction

Mythri Alle; Jayanta Biswas; S. K. Nandy

We provide a hardware realization of motion compensation and reconstruction (MCR) module for H.264 baseline profile. We synthesize the MCR module using UMC library in 0.13 mu CMOS technology. Our implementation occupies an area of 94756 gates and operates at a frequency of 250 MHz.

compilers, architecture, and synthesis for embedded systems | 2009

Streaming FFT on REDEFINE-v2: an application-architecture design space exploration

Alexander Fell; Mythri Alle; Keshavan Varadarajan; Prasenjit Biswas; Saptarsi Das; Jugantor Chetia; S. K. Nandy; Ranjani Narayan

In this paper we explore an implementation of a high-throughput, streaming application on REDEFINE-v2, which is an enhancement of REDEFINE. REDEFINE is a polymorphic ASIC combining the flexibility of a programmable solution with the execution speed of an ASIC. In REDEFINE Compute Elements are arranged in an 8x8 grid connected via a Network on Chip (NoC) called RECONNECT, to realize the various macrofunctional blocks of an equivalent ASIC. For a 1024-FFT we carry out an application-architecture design space exploration by examining the various characterizations of Compute Elements in terms of the size of the instruction store. We further study the impact by using application specific, vectorized FUs. By setting up different partitions of the FFT algorithm for persistent execution on REDEFINE-v2, we derive the benefits of setting up pipelined execution for higher performance. The impact of the REDEFINE-v2 micro-architecture for any arbitrary N-point FFT (N > 4096) FFT is also analyzed. We report the various algorithm-architecture tradeoffs in terms of area and execution speed with that of an ASIC implementation. In addition we compare the performance gain with respect to a GPP.

international conference on embedded computer systems: architectures, modeling, and simulation | 2010

Design space exploration of systolic realization of QR factorization on a runtime reconfigurable platform

Prasenjit Biswas; Keshavan Varadarajan; Mythri Alle; S. K. Nandy; Ranjani Narayan

In the world of high performance computing huge efforts have been put to accelerate Numerical Linear Algebra (NLA) kernels like QR Decomposition (QRD) with the added advantage of reconfigurability and scalability. While popular custom hardware solution in form of systolic arrays can deliver high performance, they are not scalable, and hence not commercially viable. In this paper, we show how systolic solutions of QRD can be realized efficiently on REDEFINE, a scalable runtime reconfigurable hardware platform. We propose various enhancements to REDEFINE to meet the custom need of accelerating NLA kernels. We further do the design space exploration of the proposed solution for any arbitrary application of size n×n. We determine the right size of the sub-array in accordance with the optimal pipeline depth of the core execution units and the number of such units to be used per sub-array.

ieee computer society annual symposium on vlsi | 2010

Accelerating Numerical Linear Algebra Kernels on a Scalable Run Time Reconfigurable Platform

Prasenjit Biswas; Pramod Udupa; Rajdeep Mondal; Keshavan Varadarajan; Mythri Alle; S. K. Nandy; Ranjani Narayan

Numerical Linear Algebra (NLA) kernels are at the heart of all computational problems. These kernels require hardware acceleration for increased throughput. NLA Solvers for dense and sparse matrices differ in the way the matrices are stored and operated upon although they exhibit similar computational properties. While ASIC solutions for NLA Solvers can deliver high performance, they are not scalable, and hence are not commercially viable. In this paper, we show how NLA kernels can be accelerated on REDEFINE, a scalable runtime reconfigurable hardware platform. Compared to a software implementation, Direct Solver (Modified Faddeevs algorithm) on REDEFINE shows a 29X improvement on an average and Iterative Solver (Conjugate Gradient algorithm) shows a 15-20% improvement. We further show that solution on REDEFINE is scalable over larger problem sizes without any notable degradation in performance.

International Journal of Computer Applications | 2010

A Generic Graph-Oriented Mapping Strategy for a Honeycomb Topology

Gaurav Singh; Mythri Alle; Keshavan Vardarajan; S. K. Nandy; Ranjani Narayan

REDEFINE [3] is a polymorphic ASIC, in which arbitrary computational structures on hardware are defined at runtime. The REDEFINE execution fabric comprises Compute Elements (CEs) interconnected by a Honeycomb network, which also serves as the distributed Network-on-chip. Each computational structure is dynamically assigned to a subset of the CEs on the execution fabric by the REDEFINE support logic. A HLL specification of the application is compiled into Hyper Operations (HyperOps) by the REDEFINE compiler [3], where each HyperOp is a set of interacting operations. The compiler also determines partitions of the HyperOps (pHyperOps) that can be assigned to CEs to suitably meet the structural constraints of the execution fabric. In this paper we propose an algorithm to map HyperOps onto Computational Structures. A pHyperOp communication graph (PCG) captures the communication between the various pHyperOps. Through a sequence of transformations, the PCG is transformed into a Cayley tree. The Cayley tree is then overlayed on the Cayley graph to form a computational structure. The proposed mapping algorithm offers a solution that incurs a penalty 18% on average over that of the optimal.

Explore More