Saptarsi Das
Indian Institute of Science
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Saptarsi Das.
ACM Transactions in Embedded Computing Systems | 2009
Mythri Alle; Keshavan Varadarajan; Alexander Fell; C. Ramesh Reddy; Nimmy Joseph; Saptarsi Das; Prasenjit Biswas; Jugantor Chetia; Adarsh Rao; S. K. Nandy; Ranjani Narayan
Emerging embedded applications are based on evolving standards (e.g., MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between different standards and their derivatives. In order to achieve high resource utilization and low power dissipation, we propose REDEFINE, a polymorphic ASIC in which specialized hardware units are replaced with basic hardware units that can create the same functionality by runtime re-composition. It is a “future-proof” custom hardware solution for multiple applications and their derivatives in a domain. In this article, we describe a compiler framework and supporting hardware comprising compute, storage, and communication resources. Applications described in high-level language (e.g., C) are compiled into application substructures. For each application substructure, a set of compute elements on the hardware are interconnected during runtime to form a pattern that closely matches the communication pattern of that particular application. The advantage is that the bounded CEs are neither processor cores nor logic elements as in FPGAs. Hence, REDEFINE offers the power and performance advantage of an ASIC and the hardware reconfigurability and programmability of that of an FPGA/instruction set processor. In addition, the hardware supports custom instruction pipelining. Existing instruction-set extensible processors determine a sequence of instructions that repeatedly occur within the application to create custom instructions at design time to speed up the execution of this sequence. We extend this scheme further, where a kernel is compiled into custom instructions that bear strong producer-consumer relationship (and not limited to frequently occurring sequences of instructions). Custom instructions, realized as hardware compositions effected at runtime, allow several instances of the same to be active in parallel. A key distinguishing factor in majority of the emerging embedded applications is stream processing. To reduce the overheads of data transfer between custom instructions, direct communication paths are employed among custom instructions. In this article, we present the overview of the hardware-aware compiler framework, which determines the NoC-aware schedule of transports of the data exchanged between the custom instructions on the interconnect. The results for the FFT kernel indicate a 25% reduction in the number of loads/stores, and throughput improves by log(n) for n-point FFT when compared to sequential implementation. Overall, REDEFINE offers flexibility and a runtime reconfigurability at the expense of 1.16× in power and 8× in area when compared to an ASIC. REDEFINE implementation consumes 0.1× the power of an FPGA implementation. In addition, the configuration overhead of the FPGA implementation is 1,000× more than that of REDEFINE.
compilers, architecture, and synthesis for embedded systems | 2009
Alexander Fell; Mythri Alle; Keshavan Varadarajan; Prasenjit Biswas; Saptarsi Das; Jugantor Chetia; S. K. Nandy; Ranjani Narayan
In this paper we explore an implementation of a high-throughput, streaming application on REDEFINE-v2, which is an enhancement of REDEFINE. REDEFINE is a polymorphic ASIC combining the flexibility of a programmable solution with the execution speed of an ASIC. In REDEFINE Compute Elements are arranged in an 8x8 grid connected via a Network on Chip (NoC) called RECONNECT, to realize the various macrofunctional blocks of an equivalent ASIC. For a 1024-FFT we carry out an application-architecture design space exploration by examining the various characterizations of Compute Elements in terms of the size of the instruction store. We further study the impact by using application specific, vectorized FUs. By setting up different partitions of the FFT algorithm for persistent execution on REDEFINE-v2, we derive the benefits of setting up pipelined execution for higher performance. The impact of the REDEFINE-v2 micro-architecture for any arbitrary N-point FFT (N > 4096) FFT is also analyzed. We report the various algorithm-architecture tradeoffs in terms of area and execution speed with that of an ASIC implementation. In addition we compare the performance gain with respect to a GPP.
international conference on embedded computer systems architectures modeling and simulation | 2014
Kavitha T. Madhu; Saptarsi Das; C. Madhava Krishna; S. Nalesh; S. K. Nandy; Ranjani Narayan
In this paper we present HyperCell as a reconfigurable datapath for Instruction Extensions (IEs). HyperCell comprises an array of compute units laid over a switch network. We present an IE synthesis methodology that enables post-silicon realization of IE datapaths on HyperCell. The synthesis methodology optimally exploits hardware resources in HyperCell to enable software pipelined execution of IEs. Exploitation of temporal reuse of data in HyperCell results in significant reduction of input/output bandwidth requirements of HyperCell.
Journal of Systems Architecture | 2014
Saptarsi Das; Kavitha T. Madhu; Madhav Krishna; Nalesh Sivanandan; Farhad Merchant; Santhi Natarajan; Ipsita Biswas; Adithya Pulli; S. K. Nandy; Ranjani Narayan
In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of resources on the reconfigurable data-path. In this context we present the techniques used to realize IEs for applications that demand high throughput or those that must process data streams. The reconfigurable hardware called HyperCell comprises a reconfigurable execution fabric. The fabric is a collection of interconnected compute units. A typical use case of HyperCell is where it acts as a co-processor with a host and accelerates execution of IEs that are defined post-silicon. We demonstrate the effectiveness of our approach by evaluating the performance of some well-known integer kernels that are realized as IEs on HyperCell. Our methodology for realizing IEs through HyperCells permits overlapping of potentially all memory transactions with computations. We show significant improvement in performance for streaming applications over general purpose processor based solutions, by fully pipelining the data-path
international conference on vlsi design | 2016
Saptarsi Das; Nalesh Sivanandan; Kavitha T. Madhu; S. K. Nandy; Ranjani Narayan
In this paper, we present an architecture named REDEFINE Hyper Cell Multicore (RHyMe) designed to efficiently realize HPC application kernels, such as loops. RHyMe relies on the compiler to generate the meta-data for its functioning. Most of the orchestration activity for executing kernels is governed by compiler generated meta-data made use of at runtime. In RHyMe, macro operations can be realized as a hardware overlay of MIMO operations on hardware structures called Hyper Cells. While a Hyper Cell enables exploiting fine-grain instruction level and pipeline parallelism, coarse-grain parallelism is exploited among multiple Hyper Cells. Regularity exhibited by computations such as loops results in efficient usage of simple compute hardware such as Hyper Cells as well as memory structures that can be managed explicitly.
high performance computing and communications | 2015
Kavitha T. Madhu; Saptarsi Das; S. Nalesh; S. K. Nandy; Ranjani Narayan
In this paper, we present a compilation flow for HPC kernels on the REDEFINE coarse-grain reconfigurable architecture (CGRA). REDEFINE is a scalable macro-dataflow machine in which the compute elements (CEs) communicate through messages. REDEFINE offers the ability to exploit high degree of coarse-grain and pipeline parallelism. The CEs in REDEFINE are enhanced with reconfigurable macro data-paths called HyperCells that enable exploitation of fine-grain and pipeline parallelism at the level of basic instructions in static dataflow order. Application kernels that exhibit regularity in computations and memory accesses such as affine loop nests benefit from the architecture of HyperCell [1], [2]. The proposed compilation flow aims at exposing high degree of parallelism in loop nests in HPC application kernels using polyhedral analysis and generates meta-data to effectively utilize the computational resources in HyperCells. Memory is explicitly managed through compilers assistance. We address the compilation challenges such as partitioning with load balancing, mapping and scheduling computations and management of operand data while targeting multiple HyperCells in the REDEFINE architecture. The proposed solution scales well meeting the performance objectives of HPC computing.
high performance embedded architectures and compilers | 2016
Kavitha T. Madhu; Anuj Rao; Saptarsi Das; Krishna C. Madhava; S. K. Nandy; Ranjani Narayan
Performance of an application on a many-core machine primarily hinges on the ability of the architecture to exploit parallelism and to provide fast memory accesses. Exploiting parallelism in static application graphs on a multicore target is relatively easy owing to the fact that compilers can map them onto an optimal set of processing elements and memory modules. Dynamic application graphs have computations and data dependencies that manifest at runtime and hence may not be schedulable statically. Load balancing of such graphs requires runtime support (such as support for work-stealing) but results in overheads due to data and code movement. In this work, we use ReNÉ MPSoC as an alternative to the traditional many-core processing platforms to target application kernel graphs. ReNÉ is designed to be used as an accelerator to a host and offers the ability to exploit massive parallelism at multiple granularities and supports work-stealing for dynamic load-balancing. Further, it offers handles to enable and disable work-stealing selectively. ReNÉ employs an explicitly managed global memory with minimal hardware support for address translation required for relocating application kernels. We present a resource management methodology on ReNE MPSoC that encompasses a lightweight resource management hardware module and a compilation flow. Our methodology aims at identifying resource requirements at compile time and create resource boundaries (per application kernel) to guarantee performance and maximize resource utilization. The approach offers similar flexibility in resource allocation as a dynamic scheduling runtime but guarantees performance since locality of reference of data and code can be ensured.
Ipsj Transactions on System Lsi Design Methodology | 2011
Ratna Krishnamoorthy; Saptarsi Das; Keshavan Varadarajan; Mythri Alle; Masahiro Fujita; S. K. Nandy; Ranjani Narayan
Coarse Grain Reconfigurable Architectures (CGRA) support spatial and temporal computation to speedup execution and reduce reconfiguration time. Thus compilation involves partitioning instructions spatially and scheduling them temporally. The task of partitioning is governed by the opposing forces of being able to expose as much parallelism as possible and reducing communication time. We extend Edge-Betweenness Centrality scheme, originally used for detecting community structures in social and biological networks, for partitioning instructions of a dataflow graph. We also implement several other partitioning algorithms from literature and compare the execution time obtained by each of these partitioning algorithms on a CGRA called REDEFINE. Centrality based partitioning scheme outperforms several other schemes with 6-20% execution time speedup for various Cryptographic kernels. REDEFINE using centrality based partitioning performs 9× better than a General Purpose Processor, as opposed to 7.76× better without using centrality based partitioning. Similarly, centrality improves the execution time comparison of AES-128 Decryption from 11× to 13.2×.
Integration | 2017
S. Nalesh; Kavitha T. Madhu; Saptarsi Das; S. K. Nandy; Ranjani Narayan
Transistor supply voltages no longer scales at the same rate as transistor density and frequency of operation. This has led to the Dark Silicon problem, wherein only a fraction of transistors can operate at maximum frequency and nominal voltage, in order to ensure that the chip functions within the power and thermal budgets. Heterogeneous computing systems which consist of General Purpose Processors (GPPs), Graphic Processing Units (GPUs) and application specific accelerators can provide improved performance while keeping power dissipation at a realistic level. For the accelerators to be effective, they have to be specialized for related classes of application kernels and have to be synthesized from high level specifications. Coarse Grained Reconfigurable has been proposed as accelerators for a variety of application kernels. For CGRAs to be used as accelerators in the Dark Silicon era, a synthesis framework which focuses on optimizing energy efficiency, while achieving the target performance is essential. However, existing compilation techniques for CGRAs focuses on optimizing only for performance, and any reduction in energy is just a side-effect. In this paper we explore synthesizing application kernels expressed as functions, on a coarse grained composable reconfigurable array (CGCRA). The proposed reconfigurable array comprises HyperCells, which are reconfigurable macro-cells that facilitate modeling power and performance in terms of easily measurable parameters. The proposed synthesis approach takes kernels expressed in a functional language, applies a sequence of well known program transformations, explores trade-offs between throughput and energy using the power and performance models, and realizes the kernels on the CGCRA. This approach when used to map a set of signal processing and linear algebra kernels achieves resource utilization varying from 50% to 80%.
international conference on e business | 2011
Saptarsi Das; Ranjani Narayan; S. K. Nandy
In this paper we present a hardware-software hybrid technique for modular multiplication over large binary fields. The technique involves application of Karatsuba-Ofman algorithm for polynomial multiplication and a novel technique for reduction. The proposed reduction technique is based on the popular repeated multiplication technique and Barrett reduction. We propose a new design of a parallel polynomial multiplier that serves as a hardware accelerator for large field multiplications. We show that the proposed reduction technique, accelerated using the modified polynomial multiplier, achieves significantly higher performance compared to a purely software technique and other hybrid techniques. We also show that the hybrid accelerated approach to modular field multiplication is significantly faster than the Montgomery algorithm based integrated multiplication approach.