Sunil Shukla
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sunil Shukla.
ACM Queue | 2013
David F. Bacon; Rodric M. Rabbah; Sunil Shukla
The programmability of FPGAs must improve if they are to be part of mainstream computing.
design automation conference | 2012
Joshua S. Auerbach; David F. Bacon; Ioana Monica Burcea; Perry Cheng; Stephen J. Fink; Rodric M. Rabbah; Sunil Shukla
Heterogeneous systems show a lot of promise for extracting highperformance by combining the benefits of conventional architectures with specialized accelerators in the form of graphics processors (GPUs) and reconfigurable hardware (FPGAs). Extracting this performance often entails programming in disparate languages and models, making it hard for a programmer to work equally well on all aspects of an application. Further, relatively little attention is paid to co-execution - the problem of orchestrating program execution using multiple distinct computational elements that work seamlessly together. We present Liquid Metal, a comprehensive compiler and runtime system for a new programming language called Lime. Our work enables the use of a single language for programming heterogeneous computing platforms, and the seamless co-execution of the resultant programs on CPUs and accelerators that include GPUs and FPGAs. We have developed a number of Lime applications, and successfully compiled some of these for co-execution on various GPU and FPGA enabled architectures. Our experience so far leads us to believe the Liquid Metal approach is promising and can make the computational power of heterogeneous architectures more easily accessible to mainstream programmers.
field-programmable technology | 2004
Sunil Shukla; Neil W. Bergmann
Framing protocols employ cyclic redundancy check (CRC) to detect errors incurred during transmission. Generally whole frame is protected using CRC and upon detection of error, retransmission is requested. But certain protocols demand for single bit error correction capabilities for the header part of the frame, which often plays an important role in receiver synchronization. At a speed of 10 Gbps, header error correction implementation in hardware can be a bottleneck. This work presents a hardware efficient way of implementing CRC-16 over 16 bits of data, multiple bit error detection and single bit error correction on FPGA device.
ieee computer society annual symposium on vlsi | 2006
Sunil Shukla; Neil W. Bergmann; Jürgen Becker
FPGAs have been used for prototyping of ASICs, for low-volume ASIC replacement and for systems requiring in-field hardware upgrades. However, the potential to use dynamic reconfiguration to adapt FPGA operation to changing application requirements has been hampered by slow reconfiguration times, and poor CAD tool support. In this paper, a new architecture, QUKU (pronounced cuckoo), is described which uses a coarse-grained reconfigurable PE array (CGRA) overlaid on an FPGA. The low-speed reconfigurability of the FPGA is used to optimize the CGRA for different applications, while the high-speed CGRA reconfiguration is used within an application for operator re-use. An FIR filter kernel has been implemented on QUKU and is shown to have performance which bridges the gap between softcore CPUs and custom FPGA filter circuits
international parallel and distributed processing symposium | 2007
Sunil Shukla; Neil W. Bergmann; Jürgen Becker
DSP applications can be suitably represented using process network models. This paper uses a modification of Kahn process network to solve the problem of finding an optimum architectural template for coarse grain array on per application basis. By applying the model at architectural level in QUKU, better hardware efficiency is achieved for a wide domain of applications. A few widely used DSP algorithms have been presented to demonstrate the application of process network models into architectural template generation in QUKU.
ACM Transactions in Embedded Computing Systems | 2013
Neil W. Bergmann; Sunil Shukla; Jürgen Becker
A new architecture, QUKU, is proposed for implementing stream-based algorithms on FPGAs, which combines the advantages of FPGA and Coarse Grain Reconfigurable Arrays (CGRAs). QUKU consists of a dynamically reconfigurable, coarse-grain Processing Element (PE) array with an associated softcore processor providing system support. At a coarse-grain, the PE array can be reconfigured on a cycle-by-cycle basis to change the PE functionality similarly to that in a conventional CGRA. At a fine-grain, the whole FPGA can be reconfigured statically to implement a completely different PE array that serves the target application in a better way. Advantages of the fine-grain reconfiguration include individually customized PEs, adaptable numeric format support and customizable interconnect network. A prototype CAD tool framework is also developed which facilitates programming the QUKU architecture. An example application consisting of two different image detectors is implemented to demonstrate the advantages of QUKU. QUKU provides up to 140 times speedup and 40 times improvement in area-time product compared to an implementation running on an FPGA-based softcore. The area-time product for QUKU is around 16% lower than that of a custom circuit based implementation on the same FPGA. The per-PE customization provides an area-time saving of approximately 31% compared to a homogeneous 4 × 4 array of PEs for the same application. The experimental results demonstrate that a dual layered reconfigurable architecture provides significant potential benefits in terms of flexibility, area and processing efficiency over existing reconfigurable computing architectures for DSP.
applied reconfigurable computing | 2006
Sunil Shukla; Neil W. Bergmann; Jürgen Becker
To fill the gap between increasing demand for reconfigurability and performance efficiency, CGRAs are seen to be an emerging platform. In this paper, a new architecture, QUKU, is described which uses a coarse-grained reconfigurable PE array (CGRA) overlaid on an FPGA. The low-speed reconfigurability of the FPGA is used to optimize the CGRA for different applications, whilst the high-speed CGRA reconfiguration is used within an application for operator re-use. We will demonstrate the dynamic reconfigurability of QUKU by porting Sobel and Laplacian kernel for edge detection in an image frame.
formal methods | 2010
Sunil Shukla; Rodric M. Rabbah; Martin Vorbach
This paper presents a working solution for the MEMOCODE 2010 design contest. The design presented in this paper is implemented in the Xilinx V5LX330 FPGA as a custom circuit. The solution implements pattern matching logic for all the mandatory and optional patterns while maintaining the required line rate of 500 Mbps.
2016 IEEE International Conference on Rebooting Computing (ICRC) | 2016
Ankur Agrawal; Jungwook Choi; Kailash Gopalakrishnan; Suyog Gupta; Ravi Nair; Jinwook Oh; Daniel A. Prener; Sunil Shukla; Vijayalakshmi Srinivasan; Zehra Sura
Approximate computing is gaining traction as a computing paradigm for data analytics and cognitive applications that aim to extract deep insight from vast quantities of data. In this paper, we demonstrate that multiple approximation techniques can be applied to applications in these domains and can be further combined together to compound their benefits. In assessing the potential of approximation in these applications, we took the liberty of changing multiple layers of the system stack: architecture, programming model, and algorithms. Across a set of applications spanning the domains of DSP, robotics, and machine learning, we show that hot loops in the applications can be perforated by an average of 50% with proportional reduction in execution time, while still producing acceptable quality of results. In addition, the width of the data used in the computation can be reduced to 10-16 bits from the currently common 32/64 bits with potential for significant performance and energy benefits. For parallel applications we reduced execution time by 50% using relaxed synchronization mechanisms. Finally, our results also demonstrate that benefits compounded when these techniques are applied concurrently. Our results across different applications demonstrate that approximate computing is a widely applicable paradigm with potential for compounded benefits from applying multiple techniques across the system stack. In order to exploit these benefits it is essential to re-think multiple layers of the system stack to embrace approximations ground-up and to design tightly integrated approximate accelerators. Doing so will enable moving the applications into a world in which the architecture, programming model, and even the algorithms used to implement the application are all fundamentally designed for approximate computing.
field-programmable custom computing machines | 2015
Sunil Shukla; David F. Bacon
Finding bugs in software that are timing dependent or caused by non-deterministic inputs is notoriously difficult. In FPGAs, the problem is much worse because the visibility into the running design tends to be very low, and existing tools either gather too little data for diagnosis, or are so intrusive that they perturb the timing and may mask the bug. This leads to long FPGA development cycles, and is exacerbated by the steady increase in complexity of FPGA designs. We present a tool - Panoptic on - that logs data and timing information at key design points, extracts it from the FPGA, and uses it for cycle-accurate replay of the entire execution in simulation. This allows interactive debugging with full visibility into the design.