Ahmed Al Maashri
Pennsylvania State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ahmed Al Maashri.
design automation conference | 2012
Ahmed Al Maashri; Michael DeBole; Matthew Cotter; Nandhini Chandramoorthy; Yang Xiao; Vijaykrishnan Narayanan; Chaitali Chakrabarti
Video analytics introduce new levels of intelligence to automated scene understanding. Neuromorphic algorithms, such as HMAX, are proposed as robust and accurate algorithms that mimic the processing in the visual cortex of the brain. HMAX, for instance, is a versatile algorithm that can be repurposed to target several visual recognition applications. This paper presents the design and evaluation of hardware accelerators for extracting visual features for universal recognition. The recognition applications include object recognition, face identification, facial expression recognition, and action recognition. These accelerators were validated on a multi-FPGA platform and significant performance enhancement and power efficiencies were demonstrated when compared to CMP and GPU platforms. Results demonstrate as much as 7.6X speedup and 12.8X more power-efficient performance when compared to those platforms.
international conference on computer design | 2009
Ahmed Al Maashri; Guangyu Sun; Xiangyu Dong; Vijay Narayanan; Yuan Xie
Graphics Processing Units (GPUs) offer tremendous computational and processing power. The architecture requires high communication bandwidth and lower latency between computation units and caches. 3D die-stacking technology is a promising approach to meet such requirements. To the best of our knowledge no other study has investigated the implementation of 3D technology in GPUs. In this paper, we study the impact of stacking caches using the 3D technology on GPU performance. We also investigate the benefits of using 3D stacked MRAM on GPUs. Our work includes cost, power, and thermal analysis of the proposed architectural designs. Our results show a 53% geometric mean performance speedup for iso-cycle time architectures and about 19% for iso-cost architectures.
Ipsj Transactions on System Lsi Design Methodology | 2012
Sungho Park; Ahmed Al Maashri; Kevin M. Irick; Aarti Chandrashekhar; Matthew Cotter; Nandhini Chandramoorthy; Michael DeBole; Vijaykrishnan Narayanan
Neuromorphic vision algorithms are biologically-inspired computational models of the primate visual pathway. They promise robustness, high accuracy, and high energy efficiency in advanced image processing applications. Despite these potential benefits, the realization of neuromorphic algorithms typically exhibit low performance even when executed on multi-core CPU and GPU platforms. This is due to the disparity in the computational modalities prominent in these algorithms and those modalities most exploited in contemporary computer architectures. In essence, acceleration of neuromorphic algorithms requires adherence to specific computational and communicational requirements. This paper discusses these requirements and proposes a framework for mapping neuromorphic vision applications on a System-on-Chip, SoC. A neuromorphic object detection and recognition on a multi-FPGA platform is presented with performance and power efficiency comparisons to CMP and GPU implementations.
signal processing systems | 2011
Ahmed Al Maashri; Michael DeBole; Chi-Li Yu; Vijaykrishnan Narayanan; Chaitali Chakrabarti
Neuromorphic vision algorithms are biologically inspired algorithms that follow the processing that takes place in the visual cortex. These algorithms have proved to match classical computer vision algorithms in classification performance and even outperformed them in some instances. However, neuromorphic algorithms suffer from high complexity leading to poor execution times when running on general purpose processors, making them less attractive for real-time applications. FPGAs, on the other hand, have become true signal processing platforms due to their lightweight, low power consumption and massive parallel computational resources. This paper describes an FPGA-based hardware architecture that accelerates an object classification cortical model, HMAX. Compared to a CPU implementation, this hardware accelerator offers 23X (89X) speedup when mapped to a single-FPGA (multi-FPGA) platform, while maintaining a classification accuracy of 92.5%.
ieee computer society annual symposium on vlsi | 2010
Vikram Sampath Kumar; Kevin M. Irick; Ahmed Al Maashri; Narayanan Vijaykrishnan
Recent literature on fast realizations of Connected Component Labeling has proposed single-pass algorithms and architectures that are particularly suited to hardware implementation. These architectures, however, impose input constraints unsuitable for real-time systems that have diverse interface specifications and bandwidth considerations. In this paper we present a streaming Connected Component Labeling architecture that includes a scalable processor that can be tuned to match the I/O bandwidth available in modern embedded computing platforms.
international conference on computer aided design | 2011
Michael DeBole; Ahmed Al Maashri; Matthew Cotter; Chi-Li Yu; Chaitali Chakrabarti; Vijaykrishnan Narayanan
Implementations of neuromorphic algorithms are traditionally implemented on platforms which consume significant power, falling short of their biologically underpinnings. Recent improvements in FPGA technology have led to FPGAs becoming a platform in which these rapidly evolving algorithms can be implemented. Unfortunately, implementing designs on FPGAs still prove challenging for nonexperts, limiting their use in the neuroscience domain. In this paper, a FPGA framework is presented which enables neuroscientists to compose multi-FPGA systems for a cortical object classification model. This is demonstrated by mapping this algorithm onto two distinct platforms providing speedups of up to ∼28X over a reference CPU implementation.
design automation conference | 2011
Srinidhi Kestur; Kevin M. Irick; Sungho Park; Ahmed Al Maashri; Vijaykrishnan Narayanan; Chaitaili Chakrabarti
Gridding is a method of interpolating irregularly sampled data on to a uniform grid and is a critical image reconstruction step in several applications which operate on non-Cartesian sampled data. In this paper, we present an algorithm-architecture co-design framework for accelerating gridding using FPGAs. We present a parameterized hardware library for accelerating gridding to support both arbitrary and regular trajectories. We further describe our kernel automation framework which supports several kernel functions through look-up-table (LUT) based Taylor polynomial evaluation. This framework is integrated using an in-house multi-FPGA development platform which provides hardware infrastructure for integrating custom accelerators. Design-space exploration is enabled by an automation flow which allows system generation from an algorithm specification. We further provide several case studies by realizing systems for nonuniform fast Fourier transform (NuFFT) with different parameter sets and porting them on to the BEE3 platform. Results show speedups of more than 16X and 2X over existing CPU and FPGA implementations respectively, and up to 5.5 times higher performance-per-watt over a comparable GPU implementation.
Proceedings of SPIE | 2009
Kevin M. Irick; Michael DeBole; Sungho Park; Ahmed Al Maashri; Srinidhi Kestur; Chi Li Yu; Narayanan Vijaykrishnan
FPGAs have emerged as the preferred platform for implementing real-time signal processing applications. In the sub-45nm technologies, FPGAs offer significant cost and design-time advantages over application-specific custom chips and consume significantly less power than general-purpose processors while maintaining, or improving performance. Moreover, FPGAs are more advantageous than GPUs in their support for control-intensive applications, custom bit-precision operations, and diverse system interface protocols. Nonetheless, a significant inhibitor to the widespread adoption of FPGAs has been the expertise required to effectively realize functional designs that maximize application performance. While there have been several academic and commercial efforts to improve the usability of FPGAs, they have primarily focused on easing the tasks of an expert FPGA designer rather than increasing the usability offered to an application developer. In this work, the design of a scalable algorithmic-level design framework for FPGAs, AlgoFLEX, is described. AlgoFLEX offers rapid algorithmic level composition and exploration while maintaining the performance realizable from a fully custom, albeit difficult and laborious, design effort. The framework masks aspects of accelerator implementation, mapping, and communication while exposing appropriate algorithm tuning facilities to developers and system integrators. The effectiveness of the AlgoFLEX framework is demonstrated by rapidly mapping a class of image and signal processing applications to a multi-FPGA platform.
signal processing systems | 2013
Ahmed Al Maashri; Matthew Cotter; Nandhini Chandramoorthy; Michael DeBole; Chi Li Yu; Vijaykrishnan Narayanan; Chaitali Chakrabarti
Neuromorphic vision algorithms are biologically inspired models that follow the processing that takes place in the primate visual cortex. Despite their efficiency and robustness, the complexity of these algorithms results in reduced performance when executed on general purpose processors. This paper proposes an application-specific system for accelerating a neuromorphic vision system for object recognition. The system is based on HMAX, a biologically-inspired model of the visual cortex. The neuromorphic accelerators are validated on a multi-FPGA system. Results show that the neuromorphic accelerators are 13.8× (2.6×) more power efficient when compared to CPU (GPU) implementation.
international conference on computer communications | 2014
Afaq Ahmad; Samarth Arora; Ahmed Al Maashri; Sayyid Samir Al-Busaidi; Ali Al Shidhani
This paper presents an algorithmic procedure for determining the cryptographic key properties and hence matching with the required complexity and strength to assure a more reliable and secure designs of cryptographic systems. The designed algorithm is capable to provide the cryptographic key structure based on optimum solution approach. Using the Hardware Description Language (HDL), Verilog, the key can be realized on Field Programmable Gate Array (FPGA) platform and then translated into Printed Circuit Board (PCB).