Saumil G. Merchant | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Saumil G. Merchant is active.

Explore More

Publication

Featured researches published by Saumil G. Merchant.

ieee international conference on evolutionary computation | 2006

FPGA Implementation of Evolvable Block-based Neural Networks

Saumil G. Merchant; Gregory D. Peterson; Sang Ki Park; Seong G. Kong

This paper presents a hardware implementation approach for block-based neural networks (BbNNs) on a Programmable System-On-Chip. This is an intrinsic online evolution system that can be genetically evolved and adapted to changes in input data patterns dynamically without any need for multiple FPGA reconfigurations to accommodate various network structure/parameter changes. This removes a considerable bottleneck for performance. The research presented here is a first step towards an evolvable system that can be implemented as an embedded system.

midwest symposium on circuits and systems | 2010

Evolvable block-based neural network design for applications in dynamic environments

Saumil G. Merchant; Gregory D. Peterson

Dedicated hardware implementations of artificial neural networks promise to provide faster, lower-power operation when compared to software implementations executing on microprocessors, but rarely do these implementations have the flexibility to adapt and train online under dynamic conditions. A typical design process for artificial neural networks involves offline training using software simulations and synthesis and hardware implementation of the obtained network offline. This paper presents a design of block-based neural networks (BbNNs) on FPGAs capable of dynamic adaptation and online training. Specifically the network structure and the internal parameters, the two pieces of the multiparametric evolution of the BbNNs, can be adapted intrinsically, in-field under the control of the training algorithm. This ability enables deployment of the platform in dynamic environments, thereby significantly expanding the range of target applications, deployment lifetimes, and system reliability. The potential and functionality of the platform are demonstrated using several case studies.

national aerospace and electronics conference | 2008

Strategic Challenges for Application Development Productivity in Reconfigurable Computing

Saumil G. Merchant; Brian Holland; Casey Reardon; Alan D. George; Herman Lam; Greg Stitt; Melissa C. Smith; Nahid Alam; Ivan Gonzalez; Esam El-Araby; Proshanta Saha; Tarek A. El-Ghazawi; Harald Simmler

Performance and versatility requirements arising from escalating fabrication costs and design complexities are making reconfigurable computing technologies increasingly advantageous on the roadmap towards many-core technologies. This reformation in device architectures is necessitating a critical reformation in application design methods to bridge the widening semantic gap between design productivity and execution efficiency. This paper explores the strategic challenges in FPGA design methodologies and evaluates potential solutions and their impact on future DoD applications and users. A new research initiative, strategic infrastructure for reconfigurable computing applications (SIRCA), has also been proposed as a potential new DARPA program to address the FPGA productivity problem.

IEEE Transactions on Parallel and Distributed Systems | 2011

A Framework for Evaluating High-Level Design Methodologies for High-Performance Reconfigurable Computers

Esam El-Araby; Saumil G. Merchant; Tarek A. El-Ghazawi

High-performance reconfigurable computers have potential to provide substantial performance improvements over traditional supercomputers. Their acceptance, however, has been hindered by productivity challenges arising from increased design complexity, a wide array of custom design languages and tools, and often overblown sales literature. This paper presents a review and taxonomy of High-Level Languages (HLLs) and a framework for the comparative analysis of their features. It also introduces new metrics and a model based on computational effort. The proposed concepts are inspired by Netwons equations of motion and the notion of work and power in an abstract multidimensional space of design specifications. The metrics are devised to highlight two aspects of the design process: the total time-to-solution and the efficient utilization of user and computing resources at discrete time steps along the development path. The study involves analytical and experimental evaluations demonstrating the applicability of the proposed model.

microelectronics systems education | 2005

Improving embedded systems education: laboratory enhancements using programmable systems on chip

Saumil G. Merchant; Gregory D. Peterson; Donald W. Bouldin

Programmable systems on chip provide powerful capabilities to designers, including reconfigurable logic as well as embedded processors. Such devices can enhance computer engineering education by exposing students to advanced technologies while streamlining the costs and time for laboratory preparation, maintenance, and pedagogy.

ieee aerospace conference | 2011

Experiences with UPC on TILE-64 processor

Olivier Serres; Ahmad Anbar; Saumil G. Merchant; Tarek A. El-Ghazawi

Partitioned global address space (PGAS) programming model presents programmers with a globally shared address space with locality awareness and one-sided communication constructs. The shared address space and the one-sided communication constructs enhance ease-of-use of PGAS based languages and the locality awareness enables programmers and the runtime systems to achieve higher performance. Thus PGAS programming model may help address the escalating software complexity issues resulting from the proliferation of many-core processor architectures in aerospace and computing systems in general. This paper presents our experiences with Unified parallel C (UPC), a PGAS language, on the Tile64™ processor, a 64-core processor from Tilera Corporation. We ported Berkeley UPC compiler and runtime system on the Tilera architecture and evaluated two separate runtime implementation conduits of the underlying GASNet communication library, a pThreads based conduit and an MPI based conduit. Each conduit uses different on-chip, inter-core communication networks providing different latencies and bandwidths for inter-process communications. The paper presents the implementation details and empirical analyses of both approaches by comparing and evaluating results from NAS Parallel Benchmark suite. The analyses reveal various optimization opportunities based on specific many-core architectural features which are also discussed in the paper12.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

Olivier Serres; Ahmad Anbar; Saumil G. Merchant; Abdullah Kayi; Tarek A. El-Ghazawi

Partitioned Global Address Space (PGAS) languages offer significant programmability advantages with its global memory view abstraction, one-sided communication constructs and data locality awareness. These attributes place PGAS languages at the forefront of possible solutions to the exploding programming complexity in the many-core architectures. To enable the shared address space abstraction, PGAS languages use an address translation mechanism while accessing shared memory to convert shared addresses to physical addresses. This mechanism is already expensive in terms of performance in distributed memory environments, but it becomes a major bottleneck in machines with shared memory support where the access latencies are significantly lower. Multi- and many-core processors exhibit even lower latencies for shared data due to on-chip cache space utilization. Thus, efficient handling of address translation becomes even more crucial as this overhead may easily become the dominant factor in the overall data access time for such architectures. To alleviate address translation overhead, this paper introduces a new mechanism targeting multi-dimensional arrays used in most scientific and image processing applications. Relative costs and the implementation details for UPC are evaluated with different workloads (matrix multiplication, Random Access benchmark and Sobel edge detection) on two different platforms: a many-core system, the TILE64 (a 64 core processor) and a dual-socket, quad-core Intel Nehalem system (up to 16 threads). Our optimization provides substantial performance improvements, up to 40x. In addition, the proposed mechanism can easily be integrated into compilers abstracting it from the programmers. Accordingly, this improves UPC productivity as it will reduce manual optimization efforts required to minimize the address translation overhead.

national aerospace and electronics conference | 2008

Classification of Application Development for FPGA-Based Systems

Ivan Gonzalez; Esam El-Araby; Proshanta Saha; Tarek A. El-Ghazawi; Harald Simmler; Saumil G. Merchant; Brian Holland; Casey Reardon; Alan D. George; Herman Lam; Greg Stitt; Nahid Alam; Melissa C. Smith

Field-programmable gate arrays (FPGAs) have been used to accelerate DoD-related applications with promising performance. However, current development tools require significant hardware knowledge and are not amenable to the increasing complexity of FPGA-based systems. The application requirements are expected to change dramatically for future use cases, and require a well defined development methodology. This paper presents the results obtained after conducting an extensive survey and study about current FPGA tools. A classification for DoD use cases and FPGA tools is provided. This classification provides the current status of the available tools and identifies current tool limitations for DoD use cases.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Exploiting Hierarchical Parallelism Using UPC

Lingyuan Wang; Saumil G. Merchant; Tarek A. El-Ghazawi

High-Performance Computing (HPC) systems are increasingly moving towards an architecture that is deeply hierarchical. However, the execution model with single-level parallelism embodied in legacy parallel programming models falls short in exploiting the multi-level parallelism opportunities in both hardware architectures and applications. This makes the use of richer execution models imperative in order to fully exploit hierarchical parallelism. Partitioned Global Address Space (PGAS) languages such as Unified Parallel C (UPC) are growing in popularity because of their ability to provide a globally shared address space with locality awareness. While UPC provides a welcome improvement over message passing libraries, users still program with a single level of parallelism in the context of SPMD. In this paper, we explore two explicit hierarchical programming approaches based on UPC to improve programmability and performance on hierarchical architectures. The first approach orchestrates computations on multiple sets of thread groups, the second approach extends UPC with nested, shared memory multi-threading. This paper presents a detailed description of proposed approaches and demonstrates their effectiveness in the context of the NAS Parallel Benchmarks and the Unbalanced Tree Search (UTS). Experimental results indicate that the hierarchical model not only provides greater expressive power but also enhances performance, all three benchmarks exceed the performance of the standard UPC implementations after being incrementally enhanced with hierarchical parallelism.

midwest symposium on circuits and systems | 2008

An evolvable artificial neural network platform for dynamic environments

Saumil G. Merchant; Gregory D. Peterson

Dedicated hardware implementations of artificial neural networks promise to provide faster, lower power operation when compared to software implementations executing on processors. Unfortunately, custom hardware implementations do not support intrinsic training of these networks on-chip. Training is typically done using software simulations and the obtained network is synthesized and targeted to hardware offline. The ANN FPGA design presented here facilitates dynamic network structure and parameter changes required for intrinsic training of artificial neural networks, without reliance on runtime FPGA reconfigurations. This is an important feature for an online trainable system as typical reconfiguration cycle times on the order of a few milliseconds pose to be a major performance bottleneck for iterative training algorithms. The designed platform implements block-based neural networks and can be evolved and adapted in-field.

Explore More