Fatma Ezahra Sayadi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fatma Ezahra Sayadi is active.

Explore More

Publication

Featured researches published by Fatma Ezahra Sayadi.

Microprocessors and Microsystems | 2015

Optimized parallel implementation of face detection based on GPU component

Marwa Chouchene; Fatma Ezahra Sayadi; Haythem Bahri; Julien Dubois; Johel Miteran; Mohamed Atri

Display Omitted An algorithm for face detection has been implemented on CPU.An acceleration of this algorithm on GPU migration.Performance of GPU implementation shows the effectiveness of this implementation.Another optimization method on GPU are operated. Face detection is an important aspect for various domains such as: biometrics, video surveillance and human computer interaction. Generally a generic face processing system includes a face detection, or recognition step, as well as tracking and rendering phase. In this paper, we develop a real-time and robust face detection implementation based on GPU component. Face detection is performed by adapting the Viola and Jones algorithm. We have developed and designed optimized several parallel implementations of these algorithms based on graphics processors GPU using CUDA (Compute Unified Device Architecture) description.First, we implemented the Viola and Jones algorithm in the basic CPU version. The basic application is widened to GPU version using CUDA technology, and freeing CPU to perform other tasks. Then, the face detection algorithm has been optimized for the GPU using a grid topology and shared memory. These programs are compared and the results are presented. Finally, to improve the quality of face detection a second proposition was performed by the implementation of WaldBoost algorithm.

signal processing systems | 2006

G729 Voice Decoder Design

Fatma Ezahra Sayadi; Emmanuel Casseau; Mohamed Atri; Mehrez Marzougui; Rached Tourki; Eric Martin

Embedded digital signal processing (DSP) systems are usually associated with real time constraints and/or high data rates such that fully software implementations are often not satisfactory. In that case, mixed hardware/software implementations are to be investigated. This paper presents the design of a HW/SW G.729 voice decoder dedicated to embedded systems. The decoder has been built around, on the one hand a reconfigurable digital circuit (FPGA) to achieve the so called IP hardware part—the autocorrelation computation—using a linear systolic array, and on the other hand a digital signal processor (DSP) for the remainder of the algorithm. Apart such an implementation is typically driven by the use of reusable component (IP) it is of great interest for new G729-based applications such as Voice over IP (VoIP) for example. It results in an overall reduction of the execution time per frame. Another interesting point is the design of a parameterizable autocorrelation block which can be useful for a wide range of applications such as GSM 13 Kbit/s, APC 9.6 Kbit/s and G723 6.3 Kbit/s and 5.3 Kbit/s. In the G729 context and using a V50 Virtex FPGA, the execution time of this function is 10 times faster than a TMS320C6201 DSP implementation.

2016 International Image Processing, Applications and Systems (IPAS) | 2016

Fast motion estimation for HEVC video coding

Randa Khemiri; Nejmeddine Bahri; Fatma Belghith; Fatma Ezahra Sayadi; Mohamed Atri; Nouri Masmoudi

In this paper, a fast configuration for Motion Estimation (ME) is described in order to reduce the computational time of the new High Efficient Video Coding (HEVC). This configuration uses the Coded Block Flag (CBF) Fast Method (CFM), the Early Coding Unit (CU) termination (ECU) and the Early Skip Detection (ESD) modes. The Diamond Pattern is used as a search algorithm for ME in the encoding process. Compared to the latest original reference software test model (HM) 16.2 of the HEVC, experimental results had showed that the complexity is reduced, in average, by 56.75% with a small bit-rate and PSNR degradation.

Future Generation Computer Systems | 2018

Harris corner detection on a NUMA manycore

Olfa Haggui; Claude Tadonki; Lionel Lacassagne; Fatma Ezahra Sayadi; Bouraoui Ouni

Corner detection is a key kernel for many image processing procedures including pattern recognition and motion detection. The latter, for instance, mainly relies on the corner points for which spatial analyses are performed, typically on (probably live) videos or temporal flows of images. Thus, highly efficient corner detection is essential to meet the real-time requirement of associated applications. In this paper, we consider the corner detection algorithm proposed by Harris, whose the main work-flow is a composition of basic operators represented by their approximations using 3 3 matrices. The corresponding data access patterns follow a stencil model, which is known to require careful memory organization and management. Cache misses and other additional hindering factors with NUMA architectures need to be skillfully addressed in order to reach an efficient scalable implementation. In addition, with an increasingly wide vector registers, an efficient SIMD version should be designed and explicitly implemented. In this paper, we study a direct and explicit implementation of common and novel optimization strategies, and provide a NUMA-aware parallelization. Experimental results on a dual-socket INTEL Broadwell-E/EP show a noticeably good scalability performance.

International Journal of Advanced Media and Communication | 2014

Efficient implementation of Sobel edge detection algorithm on CPU, GPU and FPGA

Marwa Chouchene; Fatma Ezahra Sayadi; Yahia Said; Mohamed Atri; Rached Tourki

Many applications in image processing have high degrees of inherent parallelism and are thus good candidates for parallel implementation. In fact, programming tools for field programmable gate array FPGA, SIMD instructions on CPU and a large number of cores on graphic processor unit GPU have been developed, but it is still difficult to achieve high performance on these platforms. This paper analyses the distinct features of compute unified device architecture CUDA GPU and summarises the general program mode of CUDA. Furthermore, we present three different implementations of Sobel edge detection on CPU, FPGA and GPU. Tested image data are also used in these hardware platforms to compare computational efficiency of CPU, GPU and FPGA.

Iet Image Processing | 2018

Optimisation of HEVC motion estimation exploiting SAD and SSD GPU-based implementation

Randa Khemiri; Hassan Kibeya; Fatma Ezahra Sayadi; Nejmeddine Bahri; Mohamed Atri; Nouri Masmoudi

The new High-Efficiency Video Coding (HEVC) standard doubles the video compression ratio compared to the previous H.264/AVC at the same video quality and without any degradation. However, this important performance is achieved by increasing the encoder computational complexity. Thats why HEVC complexity is a crucial subject. The most time consuming and the most intensive computing part of HEVC is the motion estimation based principally on the sum of absolute differences (SAD) or the sum of square differences (SSD) algorithms. For these reasons, the authors proposed an implementation of these algorithms on a low cost NVIDIA GPU (graphics processing unit) using the Fermi architecture developed with Compute Unified Device Architecture language. The proposed algorithm is based on the parallel-difference and the parallel-reduction process. The investigational results show a significant speed-up in terms of execution time for most 64 × 64 pixel blocks. In fact, the proposed parallel algorithm permits a significant reduction in the execution time that reaches up to 56.17 and 30.4%, compared to the CPU, for SAD and SSD algorithms, respectively. This improvement proves that parallelising the algorithm with the new proposed reduction process for the Fermi-GPU generation leads to better results. These findings are based on a static study that determines the PU percentage utilisation for each dimension in the HEVC. This study shows that the larger PUs are the most utilised in temporal levels 3 and 4, which attain 84.56% for class E. This improvement is accompanied by an average peak signal-to-noise ratio loss of 0.095 dB and a decrease of 0.64% in terms of BitRate.

Iet Computers and Digital Techniques | 2017

Image feature extraction algorithm based on CUDA architecture: case study GFD and GCFD

Haythem Bahri; Fatma Ezahra Sayadi; Randa Khemiri; Marwa Chouchene; Mohamed Atri

Optimising computing times of applications is an increasingly important task in many different areas such as scientific and industrial applications. Graphics processing unit (GPU) is considered as one of the powerful engines for computationally demanding applications since it proposes a highly parallel architecture. In this context, the authors introduce an algorithm to optimise the computing time of feature extraction methods for the colour image. They choose generalised Fourier descriptor (GFD) and generalised colour Fourier descriptor (GCFD) models, as a method to extract the image feature for various applications such as colour object recognition in real-time or image retrieval. They compare the computing time experimental results on central processing unit and GPU. They also present a case study of these experimental results descriptors using two platforms: a NVIDIA GeForce GT525M and a NVIDIA GeForce GTX480. Their experimental results demonstrate that the execution time can considerably be reduced until 34× for GFD and 56× for GCFD.

Journal of Algorithms & Computational Technology | 2017

Optimization and performance evaluation of graphic processing units for voice processing

Fatma Ezahra Sayadi; Haythem Bahri; Marwa Chouchene; Mohamed Atri

With the advancement in the device technology and parallel architecture, field-programmable gate arrays (FPGAs) can well perform the speech processing operation. FPGAs have very impressive results, despite their low operating frequency, by completely extracting the parallelism. Nevertheless, recent central processing unit and graphic processing unit (GPU) have also an inherent feature for high performance. In fact, recent GPUs enable dramatic increases in computing performance by harnessing great number of cores. In this context, we seek to analyze the performance of the linear prediction coding algorithm implementation on two different platforms: one based on the GPU NVIDIA GeForce GTX 480 and another on the FPGA Spartan-6. Subsequently, we try to apply several optimization strategies on those platforms. The experimental results highlight the relative robustness or weakness of both these platforms. The tests prove that, for several samples, GPU manages speedups of up to 4× compared to the FPGA and around 48× compared to a sequential execution.

computer and information technology | 2014

MatLab acceleration for DWT “Daubechies 9/7” for JPEG2000 standard on GPU

Randa Khemiri; Fatma Ezahra Sayadi; Mohamed Atri; Rached Tourki

Discrete wavelet transform (DWT) has diverse applications in signal and image processing fields. In this paper, we have implemented the lifting “Cohen-Daubechies-Feauveau 9/7” algorithm on a low cost NVIDIAs GPU (Graphics Processing Unit) with MatLab to achieve speedup in computation. The efficiency of our GPU based implementation is measured and compared with CPU based algorithms. Our investigational results with GPU show performance enhancement over a factor of 1.82 compared with CPU for an image of size 4096×4096 pixels.

international multi-conference on systems, signals and devices | 2013

Integral image computation on GPU

Marwa Chouchene; Fatma Ezahra Sayadi; Mohamed Atri; Rached Tourki

In this paper we present an integral image algorithm that can run in real-time on a Graphics Processing Unit (GPU). Our system exploits the parallelisms in computation via the NVIDA CUDA programming model, which is a software platform for solving non-graphics problems in a massively parallel high performance fashion. We compare the performance of the parallel approach running on the GPU with the sequential CPU implementation across a range of image sizes.

Explore More