Ahmed Elhossini | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ahmed Elhossini is active.

Explore More

Publication

Featured researches published by Ahmed Elhossini.

international parallel and distributed processing symposium | 2015

Nexus#: A Distributed Hardware Task Manager for Task-Based Programming Models

Tamer Dallou; Nina Engelhardt; Ahmed Elhossini; Ben H. H. Juurlink

In the era of multicore systems, it is expected that the number of cores that can be integrated on a single chip will be 3-digit. The key to utilize such a huge computational power is to extract the very fine parallelism in the user program. This is non-trivial for the average programmer, and becomes very hard as the number of potential parallel instances increases. Task-based programming models such as OmpSs are promising, since they handle the detection of dependencies and synchronization for the programmer. However, state-of-the-art research shows that task management is not cheap, and introduces a significant overhead that limits the scalability of OmpSs. Nexus# is a hardware accelerator for the OmpSs runtime system, which dynamically monitors dependencies between tasks. It is fully synthesizable in VHDL, and has a distributed task graph model to achieve the best scalability. Supporting tasks with arbitrary number of parameters and any dependency pattern, Nexus# achieves better performance than Nanos, the official OmpSs runtime system, and scales well for the H264dec benchmark with very fine grained tasks, among other benchmarks from the Starbench suite.

high performance computing and communications | 2014

An Integrated Hardware-Software Approach to Task Graph Management

Nina Engelhardt; Tamer Dallou; Ahmed Elhossini; Ben H. H. Juurlink

Task-based parallel programming models with explicit data dependencies, such as OmpSs, are gaining popularity, due to the ease of describing parallel algorithms with complex and irregular dependency patterns. These advantages, however, come at a steep cost of runtime overhead incurred by dynamic dependency resolution. Hardware support for task management has been proposed in previous work as a possible solution. We present VSs, a runtime library for the OmpSs programming model that integrates the Nexus++ hardware task manager, and evaluate the performance of the VSs-Nexus++ system. Experimental results show that applications with fine-grain tasks can achieve speedups of up to 3.4×, while applications optimized for current runtimes attain 1.3×. Providing support for hardware task managers in runtime libraries is therefore a viable approach to improve the performance of OmpSs applications.

applied reconfigurable computing | 2015

An Efficient and Flexible FPGA Implementation of a Face Detection System

Hichem Ben Fekih; Ahmed Elhossini; Ben H. H. Juurlink

This paper proposes a hardware architecture based on the object detection system of Viola and Jones using Haar-like features. The proposed design is able to discover faces in real-time with high accuracy. Speed-up is achieved by exploiting the parallelism in the design, where multiple classifier cores can be added. To maintain a flexible design, classifier cores can be assigned to different images. Moreover using different training data, every core is able to detect a different object type. As development platform, the Zynq-7000 SoC from Xilinx is used, which features an ARM Cortex-A9 dual-core CPU and a programmable logic (FPGA). The current implementation focuses on the face detection and achieves a real-time detection at the rate of 16.53 FPS on image resolution of 640\(\,\times \,\)480 pixels, which represents a speed-up of 6.46 times compared to the equivalent OpenCV software solution.

international conference on embedded computer systems architectures modeling and simulation | 2014

GPGPU workload characteristics and performance analysis

Sohan Lal; Jan Lucas; Michael Andersch; Mauricio Alvarez-Mesa; Ahmed Elhossini; Ben H. H. Juurlink

GPUs are much more power-efficient devices compared to CPUs, but due to several performance bottlenecks, the performance per watt of GPUs is often much lower than what could be achieved theoretically. To sustain and continue high performance computing growth, new architectural and application techniques are required to create power-efficient computing systems. To find such techniques, however, it is necessary to study the power consumption at a detailed level and understand the bottlenecks which cause low performance. Therefore, in this paper, we study GPU power consumption at component level and investigate the bottlenecks that cause low performance and low energy efficiency. We divide the low performance kernels into low occupancy and full occupancy categories. For the low occupancy category, we study if increasing the occupancy helps in increasing performance and energy efficiency. For the full occupancy category, we investigate if these kernels are limited by memory bandwidth, coalescing efficiency, or SIMD utilization.

applied reconfigurable computing | 2017

A Quantitative Analysis of the Memory Architecture of FPGA-SoCs

Matthias Göbel; Ahmed Elhossini; Chi Ching Chi; Mauricio Alvarez-Mesa; Ben H. H. Juurlink

In recent years, so called FPGA-SoCs have been introduced by Intel (formerly Altera) and Xilinx. These devices combine multi-core processors with programmable logic. This paper analyzes the various memory and communication interconnects found in actual devices, particularly the Zynq-7020 and Zynq-7045 from Xilinx and the Cyclone V SE SoC from Intel. Issues such as different access patterns, cache coherence and full-duplex communication are analyzed, for both generic accesses as well as for a real workload from the field of video coding. Furthermore, the paper shows that by carefully choosing the memory interconnect networks as well as the software interface, high-speed memory access can be achieved for various scenarios.

international conference on microelectronics | 2016

A data access prediction unit for multimedia applications

Tareq Alawneh; Ahmed Elhossini

A large number of algorithms, which work on multimedia data, such as images and videos, perform data processing over rectangular regions of pixels. If this distinctive data access and other data accesses are exploited properly, it can yield significant application performance improvements. For the purpose of achieving efficient exploitation for these data accesses, their information should be monitored and detected at runtime. For this purpose, we propose a data access prediction unit for multimedia applications. Our simulation results reveal that the accuracy of predicting the data access information achieved by the proposed unit is, on average, 87.4%, for the evaluated workloads, when the proposed unit utilizes a history table with 64 entries.

great lakes symposium on vlsi | 2014

A generic implementation of a quantified predictor on FPGAs

Gervin Thomas; Ahmed Elhossini; Ben H. H. Juurlink

Predictors are used in many fields of computer architectures to enhance performance. With good estimations of future system behaviour, policies can be developed to improve system performance or reduce power consumption. These policies become more effective if the predictors are implemented in hardware and can provide quantified forecasts and not only binary ones. In this paper, we present and evaluate a generic predictor implemented in VHDL running on an FPGA which produces quantified forecasts. Moreover, a complete scalability analysis is presented which shows that our implementation has a maximum device utilization of less than 5%. Furthermore, we analyse the power consumption of the predictor running on an FPGA. Additionally, we show that this implementation can be clocked by over 210 MHz. Finally, we evaluate a power-saving policy based on our hardware predictor. Based on predicted idle periods, this power-saving policy uses power-saving modes and is able to reduce memory power consumption by 14.3%.

field-programmable technology | 2016

FPGA based hardware accelerator for KAZE feature extraction algorithm

Lester Kalms; Ahmed Elhossini; Ben H. H. Juurlink

Processing and understanding of visual data has a significant importance in many applications such as robotics and vision aid devices. Extracting image features is one of the important tasks in computer vision. This paper focuses on KAZE features algorithm, due to its good performance. KAZE features is a multi-scale 2D feature detection and description algorithm. It describes 2D features in a non-linear scale space by means of non-linear diffusion filtering. In this paper, the algorithm was optimized for speed, memory usage and portability. The paper presents a hardware accelerator for the scale-space analysis part of the algorithm on FPGA. A high speed-up has been achieved by this accelerator by parallelizing several parts of the algorithm and reducing the memory bandwidth.

field programmable gate arrays | 2015

An Efficient and Flexible FPGA Implementation of a Face Detection System (Abstract Only)

Hichem Ben Fakih; Ahmed Elhossini; Ben H. H. Juurlink

Robust and rapid face detection systems are constantly gaining more interest, since they represent the first stone for many challenging tasks in the field of computer vision. In this paper a software-hardware co-design approach is presented, that enables the detection of frontal faces in real time. A complete hardware implementation of all components taking part of the face detection is introduced. This work is based on the object detection framework of Viola and Jones, which makes use of a cascade of classifiers to reduce the computation time. The proposed architecture is flexible, as it allows the use of multiple instances of the face detector. This makes developers free to choose the speed range and reserved resources for this task. The current implementation runs on the Zynq SoC and receives images over IP network, which allows exposing the face detection task as a remote service that can be consumed from any device connected to the network. We performed several measurements for the final detector and the software equivalent. Using three Evaluator cores, the ZedBoard system achieves a maximal average frame rate of 13.4 FPS when analysing an image containing 640x480 pixels. This stands for an improvement of 5.25 times compared to the software solution and represents acceptable results for most real-time systems. On the ZC706 system, a higher frame rate of 16.58 FPS is achieved. The proposed hardware solution achieved 92% accuracy, which is low compared to the software solution (97%) due to different scaling algorithm. The proposed solution achieved higher frame rate compared to other solutions found in the literature.

digital systems design | 2017