Is this you? Create Your Porfile

Nikil Mehta

California Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nikil Mehta is active.

Explore More

Publication

Featured researches published by Nikil Mehta.

field-programmable custom computing machines | 2006

Packet Switched vs. Time Multiplexed FPGA Overlay Networks

Nachiket Kapre; Nikil Mehta; Michael deLorimier; Raphael Rubin; Henry Barnor; Michael J. Wilson; Michael G. Wrighton; André DeHon

Dedicated, spatially configured FPGA interconnect is efficient for applications that require high throughput connections between processing elements (PEs) but with a limited degree of PE interconnectivity (e.g. wiring up gates and datapaths). Applications which virtualize PEs may require a large number of distinct PE-to-PE connections (e.g. using one PE to simulate 100s of operators, each requiring input data from thousands of other operators), but with each connection having low throughput compared with the PEs operating cycle time. In these highly interconnected conditions, dedicating spatial interconnect resources for all possible connections is costly and inefficient. Alternatively, we can time share physical network resources by virtualizing interconnect links, either by statically scheduling the sharing of resources prior to runtime or by dynamically negotiating resources at runtime. We explore the tradeoffs (e.g. area, route latency, route quality) between time-multiplexed and packet-switched networks overlayed on top of commodity FPGAs. We demonstrate modular and scalable networks which operate on a Xilinx XC2V6000-4 at 166MHz. For our applications, time-multiplexed, offline scheduling offers up to a 63% performance increase over online, packet-switched scheduling for equivalent topologies. When applying designs to equivalent area, packet-switching is up to 2times faster for small area designs while time-multiplexing is up to 5times faster for larger area designs. When limited to the capacity of a XC2V6000, if all communication is known, time-multiplexed routing outperforms packet-switching; however when the active set of links drops below 40% of the potential links, packet-switched routing can outperform time-multiplexing

design, automation, and test in europe | 2010

A resilience roadmap

Sani R. Nassif; Nikil Mehta; Yu Cao

Technology scaling has an increasing impact on the resilience of CMOS circuits. This outcome is the result of (a) increasing sensitivity to various intrinsic and extrinsic noise sources as circuits shrink, and (b) a corresponding increase in parametric variability causing behavior similar to what would be expected with hard (topological) faults. This paper examines the issue of circuit resilience, then proposes and demonstrates a roadmap for evaluating fault rates starting at the 45nm and going down to the 12nm nodes. The complete infrastructure necessary to make these predictions is placed in the open source domain, with the hope that it will invigorate research in this area.

field programmable gate arrays | 2012

Limit study of energy & delay benefits of component-specific routing

Nikil Mehta; Raphael Rubin; André DeHon

As feature sizes scale toward atomic limits, parameter variation continues to increase, leading to increased margins in both delay and energy. The possibility of very slow devices on critical paths forces designers to increase transistor sizes, reduce clock speed and operate at higher voltages than desired in order to meet timing. With post-fabrication configurability, FPGAs have the opportunity to use slow devices on non-critical paths while selecting fast devices for critical paths. To understand the potential benefit we might gain from component-specific mapping, we quantify the margins associated with parameter variation in FPGAs over a wide range of predictive technologies (45nm-12nm) and gate sizes and show how these margins can be significantly reduced by delay-aware, component-specific routing. For the Toronto 20 benchmark set, we show that component-specific routing can eliminate delay margins induced by variation and reduce energy for energy minimal designs by 1.42-1.98×. We further show that these benefits increase as technology scales.

ACM Transactions on Autonomous and Adaptive Systems | 2011

Spatial hardware implementation for sparse graph algorithms in GraphStep

Michael deLorimier; Nachiket Kapre; Nikil Mehta; André DeHon

How do we develop programs that are easy to express, easy to reason about, and able to achieve high performance on massively parallel machines? To address this problem, we introduce GraphStep, a domain-specific compute model that captures algorithms that act on static, irregular, sparse graphs. In GraphStep, algorithms are expressed directly without requiring the programmer to explicitly manage parallel synchronization, operation ordering, placement, or scheduling details. Problems in the sparse graph domain are usually highly concurrent and communicate along graph edges. Exposing concurrency and communication structure allows scheduling of parallel operations and management of communication that is necessary for performance on a spatial computer. We study the performance of a semantic network application, a shortest-path application, and a max-flow/min-cut application. We introduce a language syntax for GraphStep applications. The total speedup over sequential versions of the applications studied ranges from a factor of 19 to a factor of 15,000. Spatially-aware graph optimizations (e.g., node decomposition, placement and route scheduling) delivered speedups from 3 to 30 times over a spatially-oblivious mapping.

ACM Transactions on Reconfigurable Technology and Systems | 2015

GROK-LAB: Generating Real On-chip Knowledge for Intra-cluster Delays Using Timing Extraction

Benjamin Gojman; Sirisha Nalmela; Nikil Mehta; Nicholas Howarth; André DeHon

Timing Extraction identifies the delay of fine-grained components within an FPGA. From these computed delays, the delay of any path can be calculated. Moreover, a comparison of the fine-grained delays allows a detailed understanding of the amount and type of process variation that exists in the FPGA. To obtain these delays, Timing Extraction measures, using only resources already available in the FPGA, the delay of a small subset of the total paths in the FPGA. We apply Timing Extraction to the Logic Array Block (LAB) on an Altera Cyclone III FPGA to obtain a view of the delay down to near-individual LUT SRAM cell granularity, characterizing components with delays on the order of tens to a few hundred picoseconds with a resolution of ±3.2ps, matching the expected error bounds. This information reveals that the 65nm process used has, on average, random variation of σ μ =4.0% with components having an average maximum spread of 83ps. Timing Extraction also shows that as VDD decreases from 1.2V to 0.9V in a Cyclone IV 60nm FPGA, paths slow down, and variation increases from σ μ =4.3% to σ μ =5.8%, a clear indication that lowering VDD magnifies the impact of random variation.

field-programmable technology | 2013

Exploiting partially defective LUTs: Why you don't need perfect fabrication

André DeHon; Nikil Mehta

Shrinking integrated circuit feature sizes lead to increased variation and higher defect rates. Prior work has shown how to tolerate the failure of entire LUTs and how to tolerate failures and high variation in interconnect. We show how to use LUTs even when they are partially defective - a form of fine-grained defect tolerance. We characterize the defect tolerance of a range of mapping strategies for defective LUTs, including LUT swapping in a cluster, input permutation, input polarity selection, defect-aware packing, and defect-aware placement. By tolerating partially defective LUTs, we show that, even without allocating dedicated spare LUTs, it is possible to achieve near perfect yield with cluster local remapping when roughly 1% of the LUT multiplexers fail to switch. With full, defect-aware placement, this can increase to 10-25% with just a few extra rows and columns. In contrast, substitution of perfect LUTs to dedicated spares only tolerates failure rates of 0.01-0.05%.

Low-Power Variation-Tolerant Design in Nanometer Silicon | 2011

Variation and Aging Tolerance in FPGAs

Nikil Mehta; André DeHon

Parameter variation and component aging are becoming a significant problems for all digital circuits including FPGAs. These effects degrade performance, increase power dissipation, and cause permanent faults at manufacturing time and during the lifetime of an FPGA . Several techniques have been developed to tolerate variation and aging in ASICs; FPGA designers have been quick to adopt and customize these strategies. While FPGAs can use many ASIC techniques verbatim, they have a distinct advantage to aid in the development of more innovate solutions: reconfigurability. Reconfigurability gives us the ability to spread wear effects over the chip which is not possible in ASICs. This chapter examines the impact of variation and wear on FPGAs and notes the benefit that can be gained from variation and aging tolerance techniques that operate open-loop.

Low-Power Variation-Tolerant Design in Nanometer Silicon | 2011

Component-Specific Mapping for Low-Power Operation in the Presence of Variation and Aging

Benjamin Gojman; Nikil Mehta; Raphael Rubin; André DeHon

Traditional solutions to variation and aging cost energy. Adding static margins to tolerate high device variance and potential device degradation prevent aggressive voltage scaling to reduce energy. Post-fabrication configuration, as we have in FPGAs, provides an opportunity to avoid the high costs of static margins. Rather than assuming worst-case device characteristics, we can deploy devices based on their fabricated or aged characteristics. This allows us to place the high-speed/leaky devices as needed on critical paths and slower/less-leaky devices on non-critical paths. As a result, it becomes possible to meet system timing requirements at lower voltages than conservative margins. To exploit this post-fabrication configurability, we must customize the assignment of logical functions to resources based on the resource characteristics of a particular component after it has been fabricated and the resource characteristics have been determined—that is, component-specific mapping. When we perform this component-specific mapping, we can accommodate extremely high defect rates (e.g., 10%), high variation (e.g., \(\sigma_{V_{t}}=38\)%), as well as lifetime aging effects with low overhead. As the magnitude of aging effects increase, the mapping of functions to resources becomes an adaptive process that is continually refined in-system, throughout the lifetime of the component.

Low-Power Variation-Tolerant Design in Nanometer Silicon | 2011

Low-Power Techniques for FPGAs

Nikil Mehta; André DeHon

Field-programmable gate arrays (FPGAs) are reconfigurable devices that can be programmed after fabrication to implement any digital logic. As such, they are flexible, easy to modify in-field, and cheaper to use than manufacturing a customized application-specific integrated circuit (ASIC). However, this programmability comes at a cost in terms of area, performance, and perhaps most importantly power. As currently manufactured, FPGAs are significantly less power efficient than ASICs. Fortunately, in the last decade concentrated attention to power consumption has identified many approaches to power reduction. This chapter surveys the techniques and progress made to improve FPGA power efficiency.

field-programmable custom computing machines | 2006