Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Muhsen Owaida is active.

Publication


Featured researches published by Muhsen Owaida.


field-programmable custom computing machines | 2011

Synthesis of Platform Architectures from OpenCL Programs

Muhsen Owaida; Nikolaos Bellas; Konstantis Daloukas; Christos D. Antonopoulos

The problem of automatically generating hardware modules from a high level representation of an application has been at the research forefront in the last few years. In this paper, we use OpenCL, an industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our architectural synthesis tool, SOpenCL (Silicon-OpenCL), adapts OpenCL into a novel hardware design flow which efficiently maps coarse and fine-grained parallelism of an application onto an FPGA reconfigurable fabric. SOpenCL is based on a source-to-source code transformation step that coarsens the OpenCL fine-grained parallelism into a series of nested loops, and on a template-based hardware generation back-end that configures the accelerator based on the functionality and the application performance and area requirements. Our experimentation with a variety of OpenCL and C kernel benchmarks reveals that area, throughput and frequency optimized hardware implementations are attainable using SOpenCL.


field-programmable custom computing machines | 2012

Shortening Design Time through Multiplatform Simulations with a Portable OpenCL Golden-model: The LDPC Decoder Case

Gabriel Falcao; Muhsen Owaida; David Novo; Madhura Purnaprajna; Nikolaos Bellas; Christos D. Antonopoulos; Georgios Karakonstantis; Andreas Burg; Paolo Ienne

Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bit width and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems. Currently, high-performance computing offers a wide set of acceleration options, that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. Depending on the simulation requirements, the ideal architecture to use can vary. In this paper we propose a new design flow based on Open CL, a unified multiplatform programming model, which accelerates LDPC decoding simulations, thereby significantly reducing architectural exploration and design time. Open CL-based parallel kernels are used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpen CL (Silicon to Open CL), a tool that automatically converts Open CL kernels to RTL for mapping the simulations into FPGAs. To the best of our knowledge, this is the first time that a single, unmodified Open CL code is used to target those three different platforms. We show that, depending on the design parameters to be explored in the simulation, on the dimension and phase of the design, the GPU or the FPGA may suit different purposes more conveniently, providing different acceleration factors. For example, although simulations can typically execute more than 3× faster on FPGAs than on GPUs, the overhead of circuit synthesis often outweighs the benefits of FPGA-accelerated execution.


international conference on management of data | 2017

Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures

David Sidler; Zsolt István; Muhsen Owaida; Gustavo Alonso

Taking advantage of recently released hybrid multicore architectures, such as the Intels Xeon+FPGA machine, where the FPGA has coherent access to the main memory through the QPI bus, we explore the benefits of specializing operators to hardware. We focus on two commonly used SQL operators for strings: LIKE, and REGEXP_LIKE, and provide a novel and efficient implementation of these operators in reconfigurable hardware. We integrate the hardware accelerator into MonetDB, a main-memory column store, and demonstrate a significant improvement in response time and throughput. Our Hardware User Defined Function (HUDF) can speed up complex pattern matching by an order of magnitude in comparison to the database running on a 10-core CPU. The insights gained from integrating hardware based string operators into MonetDB should also be useful for future designs combining hardware specialization and databases.


international conference on multimedia and expo | 2009

A high performance and low power hardware architecture for the transform & quantization stages in H.264

Muhsen Owaida; Maria G. Koziri; Ioannis Katsavounidis; Georgios I. Stamoulis

In this work, we present a hardware architecture prototype for the various types of transforms and the accompanying quantization, supported in H.264 baseline profile video encoding standard. The proposed architecture achieves high performance and can satisfy Quad Full High Definition (QFHD) (3840·2160@150Hz) coding. The transforms are implemented using only add and shift operations, which reduces the computation overhead. A modification in the quantization equations representation is suggested to remove the absolute value and resign operation stages overhead. Additionally, a post-scale Hadamard transform computation is presented. The architecture can achieve a reduction of about 20% in power consumption, compared to existing implementations.


international conference on computer aided design | 2011

Massively parallel programming models used as hardware description languages: the OpenCL case

Muhsen Owaida; Nikolaos Bellas; Christos D. Antonopoulos; Konstantis Daloukas; Charalambos Antoniadis

The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this paper, we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators, based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore, a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity, and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers, thereby expanding the scope of FPGAs beyond the realm of hardware design.


field programmable custom computing machines | 2017

Centaur: A Framework for Hybrid CPU-FPGA Databases

Muhsen Owaida; David Sidler; Kaan Kara; Gustavo Alonso

Accelerating relational databases in general and SQL in particular has become an important topic given thechallenges arising from large data collections and increasinglycomplex workloads. Most existing work, however, has beenfocused on either accelerating a single operator (e.g., a join) orin data reduction along the data path (e.g., from disk to CPU). In this paper we focus instead on the system aspects of accelerating a relational engine in hybrid CPU-FPGA architectures. In particular, we present Centaur, a framework running on theFPGA that allows the dynamic allocation of FPGA operatorsto query plans, pipelining these operators among themselveswhen needed, and the hybrid execution of operator pipelinesrunning on the CPU and the FPGA. Centaur is fully compatiblewith relational engines as we demonstrate through its seamlessintegration with MonetDB, a popular column store database. Inthe paper, we describe how this integration is achieved, andempirically demonstrate the advantages of such an approach. The main contribution of the paper is to provide a realisticsolution for accelerating SQL that is compatible with existingdatabase architectures, thereby opening up the possibilities forfurther exploration of FPGA based data processing.


field programmable logic and applications | 2015

Automatic support for multi-module parallelism from computational patterns

Nithin George; HyoukJoong Lee; David Novo; Muhsen Owaida; David L. Andrews; Kunle Olukotun; Paolo Ienne

Field Programmable Gate Arrays (FPGAs) can be customized into application-specific architectures to achieve high performance and energy-efficiency. Unfortunately, they are yet to gain significant adoption by application developers due to their low-level programming model. Moreover, to obtain good performance in an FPGA design, one often needs to correctly parallelize computation and balance the computational throughput with the available data access bandwidth. To address the programming model problem, recent efforts have focused on composing applications out of parallel computational patterns, such as map, reduce, zipWith and foreach, and leveraging the properties of these patterns to generate highly parallel hardware modules capable of high performance. In this work, we focus on the problem of further improving the performance and show that we can utilize the knowledge of how data is consumed and produced by these computational patterns in conjunction with the information of the system architecture to automatically parallelize computations across multiple hardware modules. To achieve this, we automatically infer synchronization needs arising due to parallelization and generate a complete system that can obtain high performance for a given application. We evaluate our approach using seven applications from different domains and show that our automatically generated designs achieve performance improvements ranging from 1.8 to 9.4 times.


field programmable gate arrays | 2016

FPRESSO: Enabling Express Transistor-Level Exploration of FPGA Architectures

Grace Zgheib; Manana Lortkipanidze; Muhsen Owaida; David Novo; Paolo Ienne

In theory, tools like VTR---a retargetable toolchain mapping circuits onto easily-described hypothetical FPGA architectures---could play a key role in the development of wildly innovative FPGA architectures. In practice, however, the experiments that one can conduct with these tools are severely limited by the ability of FPGA architects to produce reliable delay and area models---these depend on transistor-level design techniques which require a different set of skills. In this paper, we introduce a novel approach, which we call Fpresso, to model the delay and area of a wide range of largely different FPGA architectures quickly and with reasonable accuracy. We take inspiration from the way a standard-cell flow performs large scale transistor-size optimization and apply the same concepts to FPGAs, only at a coarser granularity. Skilled users prepare for \fpresso locally optimized libraries of basic components with a variety of driving strengths. Then, ordinary users specify arbitrary FPGA architectures as interconnects of basic components. This is globally optimized within minutes through an ordinary logic synthesis tool which chooses the most fitting version of each cell and adds buffers wherever appropriate. The resulting delay and area characteristics can be automatically used for VTR. Our results show that \fpresso provides models that are on average within some 10-20\% of those by a state-of-the-art FPGA optimization tool and is orders of magnitude faster. Although the modelling error may appear relatively high,we show that it seldom results in misranking a set of architectures, thus indicating a reasonable modeling faithfulness.


international conference on management of data | 2017

doppioDB: A Hardware Accelerated Database

David Sidler; Zsolt István; Muhsen Owaida; Kaan Kara; Gustavo Alonso

Relational databases provide a wealth of functionality to a wide range of applications. Yet, there are tasks for which they are less than optimal, for instance when processing becomes more complex (e.g., matching regular expressions) or the data is less structured (e.g., text or long strings). In this demonstration we show the benefit of using specialized hardware for such tasks and highlight the importance of a flexible, reusable mechanism for extending database engines with hardware-based operators. We present doppioDB which consists of MonetDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs). In our demonstration the HUDFs are used to provide seamless acceleration of two string operators, LIKE and REGEXP_LIKE, and two analytics operators, SKYLINE and SGD (stochastic gradient descent). We evaluate doppioDB on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables. For integration we rely on HUDFs as a unit of scheduling and management on the FPGA. In the demonstration we show the acceleration benefits of hardware operators, as well as their flexibility in accommodating changing workloads.


field-programmable custom computing machines | 2013

On the Portability of the OpenCL Dwarfs on Fixed and Reconfigurable Parallel Platforms

Konstantinos Krommydas; Muhsen Owaida; Christos D. Antonopoulos; Nikolaos Bellas; Wu-chun Feng

We present a hardware architecture of a heapsort algorithm, the sorting is employed in a subband coding block of a wavelet-based image coder termed Oktem image coder. Although this coder provides good image quality, the sorting is time consuming, and is application specific, as the sorting is repetitively used for different volume of data in the subband coding, thus a simple hardware implementation with fixed sorting capacity will be difficult to scale during runtime. To tackle this problem, the time/power efficiency and the sorting size flexibility have to be taken in to account. We proposed an improved FPGA heapsort architecture based on Zabolotnys work as an IP accelerator of the image coder. We present a configurable architecture by using adaptive layer enable elements so the sorting capacity could be adjusted during runtime to efficiently sort different amount of data. With the adaptive memory shutdown, our improved architecture provides up to 20.9% power reduction on the memories compared to the baseline implementation. Moreover, our architecture provides 13x speedup compared to ARM CortexA 9.

Collaboration


Dive into the Muhsen Owaida's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Paolo Ienne

École Polytechnique Fédérale de Lausanne

View shared research outputs
Top Co-Authors

Avatar

David Novo

École Polytechnique Fédérale de Lausanne

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge