Alexander Fell
Indraprastha Institute of Information Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alexander Fell.
ACM Transactions in Embedded Computing Systems | 2009
Mythri Alle; Keshavan Varadarajan; Alexander Fell; C. Ramesh Reddy; Nimmy Joseph; Saptarsi Das; Prasenjit Biswas; Jugantor Chetia; Adarsh Rao; S. K. Nandy; Ranjani Narayan
Emerging embedded applications are based on evolving standards (e.g., MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between different standards and their derivatives. In order to achieve high resource utilization and low power dissipation, we propose REDEFINE, a polymorphic ASIC in which specialized hardware units are replaced with basic hardware units that can create the same functionality by runtime re-composition. It is a “future-proof” custom hardware solution for multiple applications and their derivatives in a domain. In this article, we describe a compiler framework and supporting hardware comprising compute, storage, and communication resources. Applications described in high-level language (e.g., C) are compiled into application substructures. For each application substructure, a set of compute elements on the hardware are interconnected during runtime to form a pattern that closely matches the communication pattern of that particular application. The advantage is that the bounded CEs are neither processor cores nor logic elements as in FPGAs. Hence, REDEFINE offers the power and performance advantage of an ASIC and the hardware reconfigurability and programmability of that of an FPGA/instruction set processor. In addition, the hardware supports custom instruction pipelining. Existing instruction-set extensible processors determine a sequence of instructions that repeatedly occur within the application to create custom instructions at design time to speed up the execution of this sequence. We extend this scheme further, where a kernel is compiled into custom instructions that bear strong producer-consumer relationship (and not limited to frequently occurring sequences of instructions). Custom instructions, realized as hardware compositions effected at runtime, allow several instances of the same to be active in parallel. A key distinguishing factor in majority of the emerging embedded applications is stream processing. To reduce the overheads of data transfer between custom instructions, direct communication paths are employed among custom instructions. In this article, we present the overview of the hardware-aware compiler framework, which determines the NoC-aware schedule of transports of the data exchanged between the custom instructions on the interconnect. The results for the FFT kernel indicate a 25% reduction in the number of loads/stores, and throughput improves by log(n) for n-point FFT when compared to sequential implementation. Overall, REDEFINE offers flexibility and a runtime reconfigurability at the expense of 1.16× in power and 8× in area when compared to an ASIC. REDEFINE implementation consumes 0.1× the power of an FPGA implementation. In addition, the configuration overhead of the FPGA implementation is 1,000× more than that of REDEFINE.
applied reconfigurable computing | 2009
Mythri Alle; Keshavan Varadarajan; Alexander Fell; S. K. Nandy; Ranjani Narayan
In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer relationship. These HyperOps are hosted on computation structures that are provisioned on demand at runtime. We also report compiler optimizations that collectively reduce the overheads of data-driven computations in runtime reconfigurable architectures. On an average, HyperOps offer a 44% reduction in total execution time and a 18% reduction in management overheads as compared to using basic blocks as coarse grained operations. We show that HyperOps formed using our compiler are suitable to support data flow software pipelining.
application specific systems architectures and processors | 2008
Joseph Nimmy; C. Ramesh Reddy; Keshavan Varadarajan; Mythri Alle; Alexander Fell; S. K. Nandy; Ranjani Narayan
A polymorphic ASIC is a runtime reconfigurable hardware substrate comprising compute and communication elements. It is a ldquofuture proofrdquo custom hardware solution for multiple applications and their derivatives in a domain. Interoperability between application derivatives at runtime is achieved through hardware reconfiguration. In this paper we present the design of a single cycle Network on Chip (NoC) router that is responsible for effecting runtime reconfiguration of the hardware substrate. The router design is optimized to avoid FIFO buffers at the input port and loop back at output crossbar. It provides virtual channels to emulate a non-blocking network and supports a simple X-Y relative addressing scheme to limit the control overhead to 9 bits per packet. The 8times8 honeycomb NoC (RECONNECT) implemented in 130 nm UMC CMOS standard cell library operates at 500 MHz and has a bisection bandwidth of 28.5 GBps. The network is characterized for random, self-similar and application specific traffic patterns that model the execution of multimedia and DSP kernels with varying network loads and virtual channels. Our implementation with 4 virtual channels has an average network latency of 24 clock cycles and throughput of 62.5% of the network capacity for random traffic. For application specific traffic the latency is 6 clock cycles and throughput is 87% of the network capacity.
application specific systems architectures and processors | 2008
Mythri Alle; Keshavan Varadarajan; Ramesh C. Ramesh; Joseph Nimmy; Alexander Fell; Adarsha Rao; S. K. Nandy; Ranjani Narayan
Application accelerators are predominantly ASICs. The cost of ASIC solutions are order of magnitudes higher than programmable processing cores. Despite this, ASIC solutions are preferred when both high performance and low power is the target. ASICs offer no flexibility in terms of it being able to cater to application derivatives, unless this has been provisioned for at the time of design. In this paper we define the architecture of Runtime Reconfigurable Hardware (RRH) as the platform for application acceleration. The proposed RRH is a homogeneous fabric comprising computing, storage and communicating resources. We also propose a synthesis methodology to realize application written a high level language (HLL) on the RRH. Applications described in HLL is compiled into application substructures. For each application substructure a set of Compute Elements interconnected in a manner that closely matches the communication pattern within it, is allocated. CEs in such a configuration is called a hardware affine. Hardware Affines are carved out on the RRH at runtime. These hardware affines are defined at compile time, and are provisioned at runtime on the fabric. By virtue of the fact that these hardware affines are NOT instruction set processor cores or Logic Elements as in FPGAs, we bear the performance and power advantage of an ASIC, and the hardware reconfigurability/programmability of that of an FPGA/Instruction Set Processor.
reconfigurable computing and fpgas | 2014
Alexander Fell; Zoltán Endre Rákossy; Anupam Chattopadhyay
In terms of energy and flexibility, Coarse-Grained Reconfigurable Architectures (CGRA) are proven to be advantageous over fine-grained architectures, massively parallel GPUs and generic CPUs. However the key challenge of programmability is preventing wide-spread adoption. To exploit instruction level parallelism inherent to such architectures, optimal scheduling and mapping of algorithmic kernels is essential. Transforming an input algorithm in the form of a Data Flow Graph (DFG) into a CGRA schedule and mapping configuration is very challenging, due the necessity to consider architectural details such as memory bandwidth requirements, communication patterns, pipelining and heterogeneity to optimally extract maximum performance. In this paper, an algorithm is proposed that employs Force-Directed Scheduling concepts to solve such scheduling and resource minimization problems. Our heuristic extensions are flexible enough for generic heterogeneous CGRAs, allowing to estimate the execution time of an algorithm with different configurations, while maximizing the utilization of available hardware. Beside our experiments, we compare also given CGRA configurations introduced by state-of-the-art mapping algorithms such as EPIMap, achieving optimal resource utilization by our schedule with a reduced overall DFG execution time by 39% on average.
symposium on cloud computing | 2009
Alexander Fell; Prasenjit Biswas; Jugantor Chetia; S. K. Nandy; Ranjani Narayan
RECONNECT is a Network-on-Chip using a honeycomb topology. In this paper we focus on properties of general rules applicable to a variety of routing algorithms for the NoC which take into account the missing links of the honeycomb topology when compared to a mesh. We also extend the original proposal [5] and show a method to insert and extract data to and from the network. Access Routers at the boundary of the execution fabric establish connections to multiple periphery modules and create a torus to decrease the node distances. Our approach is scalable and ensures homogeneity among the compute elements in the NoC. We synthesized and evaluated the proposed enhancement in terms of power dissipation and area. Our results indicate that the impact of necessary alterations to the fabric is negligible and effects the data transfer between the fabric and the periphery only marginally.
compilers, architecture, and synthesis for embedded systems | 2009
Alexander Fell; Mythri Alle; Keshavan Varadarajan; Prasenjit Biswas; Saptarsi Das; Jugantor Chetia; S. K. Nandy; Ranjani Narayan
In this paper we explore an implementation of a high-throughput, streaming application on REDEFINE-v2, which is an enhancement of REDEFINE. REDEFINE is a polymorphic ASIC combining the flexibility of a programmable solution with the execution speed of an ASIC. In REDEFINE Compute Elements are arranged in an 8x8 grid connected via a Network on Chip (NoC) called RECONNECT, to realize the various macrofunctional blocks of an equivalent ASIC. For a 1024-FFT we carry out an application-architecture design space exploration by examining the various characterizations of Compute Elements in terms of the size of the instruction store. We further study the impact by using application specific, vectorized FUs. By setting up different partitions of the FFT algorithm for persistent execution on REDEFINE-v2, we derive the benefits of setting up pipelined execution for higher performance. The impact of the REDEFINE-v2 micro-architecture for any arbitrary N-point FFT (N > 4096) FFT is also analyzed. We report the various algorithm-architecture tradeoffs in terms of area and execution speed with that of an ASIC implementation. In addition we compare the performance gain with respect to a GPP.
system on chip conference | 2015
Ramandeep Kaur; Alexander Fell; Harsh Rawat
Multi-port Static Random Access Memories (SRAM) are essential for shared data structures, especially in distributed, multi-core and multi-processing computing systems. This paper introduces an elementary multi-port memory design which can perform either dual-read or a single-write operation (2R/1W) by efficiently combining the 6 Transistor (6T) single-port SRAM (SP-SRAM). This new architecture offers a solution to the existing 8T dual-port (DP) cell problems including read-write stability issues. The design has been evaluated by comparing with the conventional solutions, in 28nm Ultra Thin Body and Box Fully Depleted Silicon on Insulator (UTBB-FDSOI) technology. A 2048 words, 64 bit memory shows 31% improvement in performance, 31% reduced area and 19% lesser power consumption than the conventional 8T dual-port SRAM (DP-SRAM). In addition, the proposed design is scalable to large memory capacities which cannot be generated directly using the available dual-port memory compilers.
system on chip conference | 2015
Gaurav Narang; Alexander Fell; Prakhar Raj Gupta; Anuj Grover
Embedded memories are the key contributor to the chip area, dynamic power dissipation and also form a significant part of critical path for high performance advanced SoCs. Therefore, optimal selection of memory instances becomes imperative for SoC designers. While EDA tools have evolved over the past years to optimally select standard logic cells depending on the timing and the power constraints, optimal memory selection is largely a manual process. We propose a framework to optimize power, performance, and area (PPA) of a memory subsystem (MSS) by including floorplan dependent delays and power consumption in interconnects and glue logic of the MSS in the pre-RTL stage. Through this framework, we demonstrate that for a 4 Mb assembly of SRAM instances, dynamic power is reduced by 44%, area by 49%, and leakage by 71% with the floorplan aware selection. The framework has the capability to use different estimates, when routing congestion is important (for example, in low cost processes with less number of metal layers). We also show that the interconnect delays are reduced by about 68% and dynamic power by 58%, if additional metal layers are available for routing compared to a low cost 6 metal process.
international conference on vlsi design | 2017
Manideepa Mukherjee; Alexander Fell; Apala Guha
In this paper, DFGenTool, a dataflow graph (DFG) generation tool, is presented, which converts loops in a sequential program given in a high-level language such as C, into a DFG. DFGenTool adapts DFGs for mapping to Coarse Grain Reconfigurable Architectures (CGRA) to enable a variety of CGRA implementations and compilers to be benchmarked against a standard set of DFGs. Several kernels have been converted and are presented in this paper as case studies. The output of DFGenTool is in DOT, a popular graph description standard which could be used with a variety of CGRA compilers. Furthermore, DFGenTool has been released as open-source.