Is this you? Create Your Porfile

Neel Gala

Indian Institute of Technology Madras

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Neel Gala is active.

Explore More

Publication

Featured researches published by Neel Gala.

international conference on vlsi design | 2014

ProCA: Progressive Configuration Aware Design Methodology for Low Power Stochastic ASICs

Neel Gala; V. R. Devanathan; Karthik Srinivasan; V. Visvanathan; V. Kamakoti

With increasing integration of capabilities into mobile application processors, a host of imaging operations that were earlier performed in software are now implemented in hardware [1]. Though imaging applications are inherently error resilient, the complexity of such designs has increased over time and thus identifying logic that can be leveraged for energy-quality trade-offs has become difficult. The paper proposes a Progressive Configuration Aware (ProCA) criticality analysis framework, that is 10X faster than the state-of-the-art, to identify logic which is functionally-critical to output quality. This accounts for the various modes of operation of the design. Through such a framework, we demonstrate how a low powered tunable stochastic design can be derived. The proposed methodology uses layered synthesis and voltage scaling mechanisms as primary tools for power reduction. We demonstrate the proposed methodology on a production quality imaging IP implemented in 28nm low leakage technology. For the tunable stochastic imaging IP, we gain up to 10.57% power reduction in exact mode and up to 32.53% power reduction in error tolerant mode (30dB PSNR), with negligible design overhead.

asian test symposium | 2015

SHAKTI-F: A Fault Tolerant Microprocessor Architecture

Sukrat Gupta; Neel Gala; G. S. Madhusudan; V. Kamakoti

Deeply scaled CMOS circuits are vulnerable to soft and hard errors. These errors pose reliability concerns, especially for systems used in radiation-prone environments like space and nuclear applications. This paper presents SHAKTI-F, a RISC-V based SEE-tolerant micro-processor architecture that provides a solution to the reliability issues mentioned above. The proposed architecture uses error correcting codes (ECC) to tolerate errors in registers and memories, while it employs a combination of space and time redundancy based techniques to tolerate errors in the ALU. Two novel re-computation techniques for detecting errors for the addition/subtraction and multiplication modules are proposed. The scheme also identifies parts of the circuitry that need to be radiation hardened thus providing a total protection to SEEs. The proposed scheme provides fine-grain error detection capability that help in localization of the error to a specific functional unit and isolating the same, rather than the entire processor or a large module within a processor. This provides a graceful degradation and/or fail-safe shutdown capability to the processor. The HDL model of the processor was validated by simulating it with randomly induced SEEs. The proposed scheme adds an extra penalty of only 20% on the core area and 25% penalty on the performance when compared with conventional systems. This is very less when compared to the penalty incurred by employing schemes including double modular and triple modular redundancy. Interestingly, there is a 45% reduction in power consumption due to introduction of fault tolerance. The resulting system runs at 330 MHz on a 55nm technology node, which is sufficient for the class of applications these cores are utilized for.

asia symposium on quality electronic design | 2013

Tunable stochastic computing using layered synthesis and temperature adaptive voltage scaling

Neel Gala; V. R. Devanathan; V. Visvanathan; Virat Gandhi; V. Kamakoti

With increasing computing power in mobile devices, conserving battery power (or extending battery life) has become crucial. This together with the fact that most applications running on these mobile devices are increasingly error tolerant, has created immense interest in stochastic (or inexact) computing. In this paper, we present a framework wherein, the devices can operate at varying error tolerant modes while significantly reducing the power dissipated. Further, in very deep sub-micron technologies, temperature has a crucial role in both performance and power. The proposed framework presents a novel layered synthesis optimization coupled with temperature aware supply and body bias voltage scaling to operate the design at various “tunable” error tolerant modes. We implement the proposed technique on a H.264 decoder block in industrial 28nm low leakage technology node, and demonstrate reductions in total power varying from 30% to 45%, while changing the operating mode from exact computing to inaccurate/error-tolerant computing.

international conference on vlsi design | 2016

SHAKTI Processors: An Open-Source Hardware Initiative

Neel Gala; Arjun Menon; Rahul Bodduna; G. S. Madhusudan; V. Kamakoti

Summary form only given. State-of-the-Art Computer Architecture research at Indian Academic Institutions is majorly restricted due to unavailability of processor design models that are close to current commercially available cores. The SHAKTI processor initiative aims at breaking this barrier between Academia and Industry by providing open-source Processor and SoC designs. With the advent of RISC-V ISA by UC Berkeley, we have a simple, clean and most importantly an open source ISA that can be used to design processors which have the potential to match the current day processors in the market. The initiative is also driven to provide substantial information about design decisions and promote more competitive learning environment in academia. The processors from SHAKTI will help in aiding research related to architecture, where one can run simulations on the actual hardware and obtain much accurate results, rather than settling with a lesser accurate software simulation. Since these processors and designs are targeted for real-world use, they can also be freely adopted by industries, thereby supporting the initiative further in terms of product-driven research.In this workshop we plan to talk about the various designs that we are working on and how they can be used. The proposed tutorial consists of four parts. The first part emphasizes on the need for open-source hardware design and the rationale behind our choices of ISA and HDL. The second part covers about our microcontroller class (C-class) core and the verification and debug environment. The third part presents our flagship processor which is the industrial class (I-class) core; the various design decisions involved and some performance metrics. The final part concludes on the note of future work and upcoming releases under SHAKTI.

Proceedings of the 8th International Workshop on Interconnection Network Architecture | 2014

CAERUS: an effective arbitration and ejection policy for routing in an unidirectional torus

Nihar Rathod; Shankar Balachandran; Neel Gala

Network on Chip(NoC) is an efficient communication fabric for multi-core systems. As on-chip core count increases, area and power consumed by NoCs have to be considered seriously. Enhancing performance of NoCs under strict area and power budgets is of practical concern. In this paper, we propose CAERUS, a low-cost router that incorporates a novel arbitration policy that balances the opportunities to inject flits and an optimized ejection policy. Experimental results show that CAERUS reduces traversal latency of the flits in the network and sustain the throughput at higher injection rates as compared to existing low-cost router designs. The area and power expenditures are comparable to existing low cost router architectures.

international symposium on low power electronics and design | 2017

A Programmable Event-driven Architecture for Evaluating Spiking Neural Networks

Arnab Roy; Swagath Venkataramani; Neel Gala; Sanchari Sen; Kamakoti Veezhinathan; Anand Raghunathan

Spiking neural networks (SNNs) represent the third generation of neural networks and are expected to enable new classes of machine learning applications. However, evaluating large-scale SNNs (e.g., of the scale of the visual cortex) on power-constrained systems requires significant improvements in computing efficiency. A unique attribute of SNNs is their event-driven nature—information is encoded as a series of spikes, and work is dynamically generated as spikes propagate through the network. Therefore, parallel implementations of SNNs on multi-cores and GPGPUs are severely limited by communication and synchronization overheads. Recent years have seen great interest in deep learning accelerators for non-spiking neural networks, however, these architectures are not well suited to the dynamic, irregular parallelism in SNNs. Prior efforts on specialized SNN hardware utilize spatial architectures, wherein each neuron is allocated a dedicated processing element, and large networks are realized by connecting multiple chips into a system. While suitable for large-scale systems, this approach is not a good match to size or cost constrained mobile devices. We propose PEASE, a Programmable Event-driven processor Architecture for SNN Evaluation. PEASE comprises of Spike Processing Units (SPUs) that are dynamically scheduled to execute computations triggered by a spike. Instructions to the SPUs are dynamically generated by Spike Schedulers (SSs) that utilize event queues to track unprocessed spikes and identify neurons that need to be evaluated. The memory hierarchy in PEASE is fully software managed, and the processing elements are interconnected using a two-tiered bus-ring topology matching the communication characteristics of SNNs. We propose a method to map any given SNN to PEASE such that the workload is balanced across SPUs and SPU clusters, while pipelining across layers of the network to improve performance. We implemented PEASE at the RTL level and synthesized it to IBM 45 technology. Across 6 SNN benchmarks, our 64-SPU configuration of PEASE achieves 7.1×−17.5× and 2.6×−5.8× speedups, respectively, over software implementations on an Intel Xeon E5-2680 CPU and NVIDIA Tesla K40C GPU. The energy reductions over the CPU and GPU are 71×−179× and 198×−467×, respectively.

hardware and architectural support for security and privacy | 2017

Shakti-T: A RISC-V Processor with Light Weight Security Extensions

Arjun Menon; Subadra Murugan; Chester Rebeiro; Neel Gala; Kamakoti Veezhinathan

With increased usage of compute cores for sensitive applications, including e-commerce, there is a need to provide additional hardware support for securing information from memory based attacks. This work presents a unified hardware framework for handling spatial and temporal memory attacks. The paper integrates the proposed hardware framework with a RISC-V based micro-architecture with an enhanced application binary interface that enables software layers to use these features to protect sensitive data. We demonstrate the effectiveness of the proposed scheme through practical case studies in addition to taking the design through a VLSI CAD design flow. The proposed processor reduces the metadata storage overhead up to 4 x in comparison with the existing solutions, while incurring an area overhead of just 1914 LUTs and 2197 flip flops on an FPGA, without affecting the critical path delay of the processor.

IEEE Transactions on Very Large Scale Integration Systems | 2017

Approximate Error Detection With Stochastic Checkers

Neel Gala; Swagath Venkataramani; Anand Raghunathan; V. Kamakoti

Designing reliable systems, while eschewing the high overheads of conventional fault tolerance techniques, is a critical challenge in the deeply scaled CMOS and post-CMOS era. To address this challenge, we leverage the intrinsic resilience of application domains such as multimedia, recognition, mining, search, and analytics where acceptable outputs are produced despite occasional approximate computations. We propose stochastic checkers (checkers designed using stochastic logic) as a new approach to performing error checking in an approximate manner at greatly reduced overheads. Stochastic checkers are inherently inaccurate and require long latencies for computation. To limit the loss in error coverage, as well as false positives (correct outputs flagged as erroneous), caused due to the approximate nature of stochastic checkers, we propose input permuted partial replicas of stochastic logic, which improves their accuracy with minimal increase in overheads. To address the challenge of long error detection latency, we propose progressive checking policies that provide an early decision based on a prefix of the checker’s output bitstream. This technique is further enhanced by employing progressively accurate binary-to-stochastic converters. Across a suite of error-resilient applications, we observe that stochastic checkers lead to greatly reduced overheads (29.5% area and 21.5% power, on average) compared with traditional fault tolerance techniques while maintaining high coverage and very low false positives.

ACM Journal on Emerging Technologies in Computing Systems | 2017

An Accuracy Tunable Non-Boolean Co-Processor Using Coupled Nano-Oscillators

Neel Gala; Sarada Krithivasan; Wei-Yu Tsai; Xueqing Li; Vijaykrishnan Narayanan; V. Kamakoti

As we enter an era witnessing the closer end of Dennard scaling, where further reduction in power supply-voltage to reduce power consumption becomes more challenging in conventional systems, a goal of developing a system capable of performing large computations with minimal area and power overheads needs more optimization aspects. A rigorous exploration of alternate computing techniques, which can mitigate the limitations of Complementary Metal-Oxide Semiconductor (CMOS) technology scaling and conventional Boolean systems, is imperative. Reflecting on these lines of thought, in this article we explore the potential of non-Boolean computing employing nano-oscillators for performing varied functions. We use a two coupled nano-oscillator as our basic computational model and propose an architecture for a non-Boolean coupled oscillator based co-processor capable of executing certain functions that are commonly used across a variety of approximate application domains. The proposed architecture includes an accuracy tunable knob, which can be tuned by the programmer at runtime. The functionality of the proposed co-processor is verified using a soft coupled oscillator model based on Kuramoto oscillators. The article also demonstrates how real-world applications such as Vector Quantization, Digit Recognition, Structural Health Monitoring, and the like, can be deployed on the proposed model. The proposed co-processor architecture is generic in nature and can be implemented using any of the existing modern day nano-oscillator technologies such as Resonant Body Transistors (RBTs), Spin-Torque Nano-Oscillators (STNOs), and Metal-Insulator Transition (MITs) . In this article, we perform a validation of the proposed architecture using the HyperField Effect Transistor (FET) technology-based coupled oscillators, which provide improvements of up to 3.5× increase in clock speed and up to 10.75× and 14.12× reduction in area and power consumption, respectively, as compared to a conventional Boolean CMOS accelerator executing the same functions.

international symposium on low power electronics and design | 2016

STOCK: Stochastic Checkers for Low-overhead Approximate Error Detection

Neel Gala; Swagath Venkataramani; Anand Raghunathan; V. Kamakotit

Designing reliable systems, while eschewing the high overheads of conventional fault tolerance techniques, is a critical challenge in the deeply scaled CMOS and post-CMOS era. To address this challenge, we leverage the intrinsic resilience of application domains such as multimedia, recognition, mining, search, and analytics where acceptable outputs are produced despite occasional approximate computations. We propose stochastic checkers, wherein a stochastic logic based realization of the circuit is used as an error checker, and the original circuits output is declared to be correct if it lies within a certain range of the checkers output. The key benefit of stochastic checkers is that the intrinsic compactness of stochastic logic leads to greatly reduced overheads. However, due to the approximate nature of stochastic circuits, errors that cause the output to be within a certain range of the correct value may not be detected (missed coverage). In addition, some correct outputs may be incorrectly flagged as erroneous (false positives). To limit the number of missed errors and false positives, we propose a technique that uses input permuted partial replicas of the stochastic logic to improve accuracy without greatly increasing the overheads. We also address the challenge of error detection latency (due to the bit-serial nature of stochastic logic) through progressive checking policies that produce an early decision based on a prefix of the checkers output bitstream. We evaluate stochastic checkers on hardware implementations of a suite of error-resilient applications, and demonstrate that they can lead to greatly reduced overheads (29.5% area and 21.5% power, on average) compared to traditional fault tolerance techniques, while achieving very high coverage (average of 99.5%) and very low false positives (average of 0.1%).

Explore More