Sameh W. Asaad | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sameh W. Asaad is active.

Explore More

Publication

Featured researches published by Sameh W. Asaad.

field programmable gate arrays | 2012

A cycle-accurate, cycle-reproducible multi-FPGA system for accelerating multi-core processor simulation

Sameh W. Asaad; Ralph Bellofatto; Bernard Brezzo; Chuck Haymes; Mohit Kapur; Benjamin D. Parker; Thomas Roewer; Proshanta Saha; Todd E. Takken; Jose A. Tierno

Software based tools for simulation are not keeping up with the demands for increased chip and system design complexity. In this paper, we describe a cycle-accurate and cycle-reproducible large-scale FPGA platform that is designed from the ground up to accelerate logic verification of the Bluegene/Q compute node ASIC, a multi-processor SOC implemented in IBMs 45 nm SOI CMOS technology. This paper discusses the challenges for constructing such large-scale FPGA platforms, including design partitioning, clocking & synchronization, and debugging support, as well as our approach for addressing these challenges without sacrificing cycle accuracy and cycle reproducibility. The resulting fullchip simulation of the Bluegene/Q compute node ASIC runs at a simulated processor clock speed of 4 MHz, over 100,000 times faster than the logic level software simulation of the same design. The vast increase in simulation speed provides a new capability in the design cycle that proved to be instrumental in logic verification as well as early software development and performance validation for Bluegene/Q.

international conference on parallel architectures and compilation techniques | 2012

Database analytics acceleration using FPGAs

Bharat Sukhwani; Hong Min; Mathew S. Thoennes; Parijat Dube; Balakrishna R. Iyer; Bernard Brezzo; Donna N. Dillenberger; Sameh W. Asaad

Business growth and technology advancements have resulted in growing amounts of enterprise data. To gain valuable business insight and competitive advantage, businesses demand the capability of performing real-time analytics on such data. This, however, involves expensive query operations that are very time consuming on traditional CPUs. Additionally, in traditional database management systems (DBMS), the CPU resources are dedicated to mission-critical transactional workloads. Offloading expensive analytics query operations to a co-processor can allow efficient execution of analytics workloads in parallel with transactional workloads. In this paper, we present a Field Programmable Gate Array (FPGA) based acceleration engine for database operations in analytics queries. The proposed solution provides a mechanism for a DBMS to seamlessly harness the FPGA compute power without requiring any changes in the application or the existing data layout. Using a software-programmed query control block, the accelerator can be tailored to execute different queries without reconfiguration. Our prototype is implemented in a PCIe-attached FPGA system and is integrated into a commercial DBMS platform. The results demonstrate up to 94% CPU savings on real customer data compared to the baseline software cost with up to an order of magnitude speedup in the offloaded computations and up to 6.2× improvement in end-to-end performance.

Ibm Journal of Research and Development | 2003

An innovative low-power high-performance programmable signal processor for digital communications

Jaime H. Moreno; Victor Zyuban; Uzi Shvadron; Fredy D. Neeser; Jeff H. Derby; Malcolm Scott Ware; Krishnan K. Kailas; Ayal Zaks; Amir Geva; Shay Ben-David; Sameh W. Asaad; Thomas W. Fox; Daniel Littrell; Marina Biberstein; Dorit Naishlos; Hillery C. Hunter

We describe an innovative, low-power, high-performance, programmable signal processor (DSP) for digital communications. The architecture of this processor is characterized by its explicit design for low-power implementations, its innovative ability to jointly exploit instruction-level parallelism and data-level parallelism to achieve high performance, its suitability as a target for an optimizing high-level language compiler, and its explicit replacement of hardware resources by compile-time practices. We describe the methodology used in the development of the processor, highlighting the techniques deployed to enable application/architecture/compiler/implementation co-development, and the optimization approach and metric used for power-performance evaluation and tradeoff analysis. We summarize the salient features of the architecture, provide a brief description of the hardware organization, and discuss the compiler techniques used to exercise these features. We also summarize the simulation environment and associated software development tools. Coding examples from two representative kernels in the digital communications domain are also provided. The resulting methodology, architecture, and compiler represent an advance of the state of the art in the area of low-power, domain-specific microprocessors.

ieee international conference on high performance computing data and analytics | 2014

Real-time scalable cortical computing at 46 giga-synaptic OPS/watt with ~100× speedup in time-to-solution and ~100,000× reduction in energy-to-solution

Andrew S. Cassidy; Rodrigo Alvarez-Icaza; Filipp Akopyan; Jun Sawada; John V. Arthur; Paul A. Merolla; Pallab Datta; Marc Gonzalez Tallada; Brian Taba; Alexander Andreopoulos; Arnon Amir; Steven K. Esser; Jeff Kusnitz; Rathinakumar Appuswamy; Chuck Haymes; Bernard Brezzo; Roger Moussalli; Ralph Bellofatto; Christian W. Baks; Michael Mastro; Kai Schleupen; Charles Edwin Cox; Ken Inoue; Steven Edward Millman; Nabil Imam; Emmett McQuinn; Yutaka Nakamura; Ivan Vo; Chen Guok; Don Nguyen

Drawing on neuroscience, we have developed a parallel, event-driven kernel for neurosynaptic computation, that is efficient with respect to computation, memory, and communication. Building on the previously demonstrated highly optimized software expression of the kernel, here, we demonstrate True North, a co-designed silicon expression of the kernel. True North achieves five orders of magnitude reduction in energy to-solution and two orders of magnitude speedup in time-to solution, when running computer vision applications and complex recurrent neural network simulations. Breaking path with the von Neumann architecture, True North is a 4,096 core, 1 million neuron, and 256 million synapse brain-inspired neurosynaptic processor, that consumes 65mW of power running at real-time and delivers performance of 46 Giga-Synaptic OPS/Watt. We demonstrate seamless tiling of True North chips into arrays, forming a foundation for cortex-like scalability. True Norths unprecedented time-to-solution, energy-to-solution, size, scalability, and performance combined with the underlying flexibility of the kernel enable a broad range of cognitive applications.

field-programmable custom computing machines | 2013

Accelerating Join Operation for Relational Databases with FPGAs

Robert J. Halstead; Bharat Sukhwani; Hong Min; Mathew S. Thoennes; Parijat Dube; Sameh W. Asaad; Balakrishna R. Iyer

In this paper, we investigate the use of field programmable gate arrays (FPGAs) to accelerate relational joins. Relational join is one of the most CPU-intensive, yet commonly used, database operations. Hashing can be used to reduce the time complexity from quadratic (naïve) to linear time. However, doing so can introduce false positives to the results which must be resolved. We present a hash-join engine on FPGA that performs hashing, conflict resolution, and joining on a PCIe-attached system, achieving greater than 11x speedup over software.

ACM Transactions on Reconfigurable Technology and Systems | 2014

A Fully Pipelined FPGA Architecture of a Factored Restricted Boltzmann Machine Artificial Neural Network

Lok-Won Kim; Sameh W. Asaad; Ralph Linsker

Artificial neural networks (ANNs) are a natural target for hardware acceleration by FPGAs and GPGPUs because commercial-scale applications can require days to weeks to train using CPUs, and the algorithms are highly parallelizable. Previous work on FPGAs has shown how hardware parallelism can be used to accelerate a “Restricted Boltzmann Machine” (RBM) ANN algorithm, and how to distribute computation across multiple FPGAs. Here we describe a fully pipelined parallel architecture that exploits “mini-batch” training (combining many input cases to compute each set of weight updates) to further accelerate ANN training. We implement on an FPGA, for the first time to our knowledge, a more powerful variant of the basic RBM, the “Factored RBM” (fRBM). The fRBM has proved valuable in learning transformations and in discovering features that are present across multiple types of input. We obtain (in simulation) a 100-fold acceleration (vs. CPU software) for an fRBM having N = 256 units in each of its four groups (two input, one output, one intermediate group of units) running on a Virtex-6 LX760 FPGA. Many of the architectural features we implement are applicable not only to fRBMs, but to basic RBMs and other ANN algorithms more broadly.

International Journal of Parallel Programming | 2015

A Hardware/Software Approach for Database Query Acceleration with FPGAs

Bharat Sukhwani; Mathew S. Thoennes; Hong Min; Parijat Dube; Bernard Brezzo; Sameh W. Asaad; Donna N. Dillenberger

Complex analytics queries often involve expensive operations that may require large computational runtimes leading to slow query responsiveness and hampering real-time performance. Moreover, running these expensive analytics queries inside traditional online transaction processing (OLTP) systems for real-time analytics can affect the performance of mission-critical OLTP queries. On the other hand, support for real-time analytics is considered vital for important business insights and improved market responsiveness. In this paper, we try to address the needs of real-time analytics by enabling hardware acceleration of complex database query operations such as predicate evaluation, sort and projection. While projection helps reduce the amount of data being processed by subsequent query operations, sort is central to most database queries, even those not involving an explicit sort operation. Our system involves FPGA-based composable accelerator for offloading the analytics queries from the host CPU running the OLTP workload. The FPGA-accelerated database system contains accelerator kernels for various database operations and automatic transformation of query operations into calls to these hardware kernels for seamless integration of the accelerator into the database system. Based on the query semantics, each accelerator kernel can be tailored by software to execute specific database operations and different kernels can be fused together to compose a query accelerator. Our query transformation algorithm creates a query-specific control block to customize the accelerator without requiring FPGA-reconfiguration.

field-programmable custom computing machines | 2011

High-Throughput, Lossless Data Compresion on FPGAs

Bharat Sukhwani; Bulent Abali; Bernard Brezzo; Sameh W. Asaad

Loss less compression is often used before writing data to a storage medium or transmitting across a transmission medium. Compression aids by saving storage space or transmission bandwidth, a decompression operation is performed when the data is subsequently read. Though this scheme has clear benefits, the execution time of compression and decompression is critical to its application in real-time systems. Software compression utilities are often slow, leading to degraded system performance. Hardware-based solutions, on the other hand, often drive large resource requirements and are not amenable to supporting future algorithmic changes. In the current article, we present a high-throughput, streaming, loss less compression algorithm and its efficient implementation on FPGAs. The proposed solution provides a peak throughput of 1GB/sec per engine, with a sustained overall measured throughput of 2.66GB/sec on a PCIe-based FPGA board with two compression and two decompression engines. This result represents an overall speedup of 13.6x over reference software implementation. The proposed design is very lean, and, with multiple engines running in parallel, provides a path to potential speedups of up to two orders of magnitude. In the current implementation, the achievable overall throughput is limited only by the available PCIe bus bandwidth.

field-programmable custom computing machines | 2015

Fast and Flexible Conversion of Geohash Codes to and from Latitude/Longitude Coordinates

Roger Moussalli; Mudhakar Srivatsa; Sameh W. Asaad

Insights extracted from spatial queries in geodatabase systems introduce significant opportunities for business intelligence. However, geodatabases are unable to keep up with the required performance due to the massive (and sky-rocketing) amounts of data generated from embedded location-enabled devices. In this paper, we focus on geographic information systems that make use of geohash, specifically, we tackle the kernel of converting geohash codes to and from longitude/latitude pairs. We present the first hardware implementation of a geohash conversion engine operating at wire speed. The presented geohash converter is further enhanced with runtime flexibility with respect to characteristics of the data it can process, furthermore, the architecture allows the user to compromise on performance when limited by hardware resources (design time flexibility). Experimental results of the geohash conversion engine on a Xilinx XC7K325T FPGA show >13X (end-to-end) speedup compared to optimized industry-grade software running on 16 CPU hardware threads.

great lakes symposium on vlsi | 2004

Design methodology for semi custom processor cores

Victor Zyuban; Sameh W. Asaad; Thomas W. Fox; Anne-Marie Haen; Daniel Littrell; Jaime H. Moreno

We describe a semi-custom design methodology for embedded processor cores that was prototyped through the development of a low power high performance DSP core. When compared to the standard ASIC design flow, this methodology enables significant improvement in the speed and power; such benefits are obtained without compromising the generality and flexibility that characterizes the ASIC-based design techniques. Our methodology achieves fast turn-around time in the process from RTL description to post-PD timing results, and exhibits stable convergence on timing; these characteristics enable the application of optimizations spanning multiple levels of the design hierarchy. Such optimizations proved to be much more effective than those that focus only on a single design stage.

Explore More