Said Belkouch | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Said Belkouch is active.

Explore More

Publication

Featured researches published by Said Belkouch.

2011 Faible Tension Faible Consommation (FTFC) | 2011

Low power and fast DCT architecture using multiplier-less method

M. El Aakif; Said Belkouch; Noureddine Chabini; Moha M'rabet Hassani

In this paper, a low power and fast DCT (Discrete Cosine Transform) using multiplier-less method is presented with a new modified FGA (Flow-Graph Algorithm), which is derived from our previously presented FGA of DCT based on Loeffler algorithm. The multiplier-less method is based on the replacement of multiplications with a minimum number of additions and shifts. The proposed FGA is performed and compared to a previous one. The results of FPGA implementations on Altera Cyclone II show the increase of the maximum frequency, the decrease of the resources usage and the reduction of the dynamic power by 7.2 % at 120 MHz of clock frequency with a new proposed FGA algorithm. Another comparison with recent published results has been done and proves the efficiency of the proposed FGA.

international conference on multimedia computing and systems | 2011

VHDL implementation of an optimized 8-point FFT/IFFT processor in pipeline architecture for OFDM systems

Mounir Arioua; Said Belkouch; Mohamed Agdad; Moha M'rabet Hassani

The Fast Fourier Transform (FFT) and its inverse transform (IFFT) processor are key components in many communication systems. An optimized implementation of the 8-point FFT processor with radix-2 algorithm in R2MDC architecture is presented in this paper. The butterfly — Processing Element (PE) used in the 8-FFT processor reduces the multiplicative complexity by using a real constant multiplication in one method and eliminates the multiplicative complexity by using add and shift operations in other proposed method. The pipeline architecture R2MDC has been implemented with the 8-point module and simulation results show that this module significantly achieves a better performance with lower resource usage.

international symposium on visual computing | 2010

Improved implementation of a modified Discrete Cosine Transform on low-cost FPGA

Said Belkouch; M. El Aakif; A. Ait Ouahman; Moha M'rabet Hassani

In this paper, Discrete Cosine Transform hardware implementations are performed using two different modified Loeffler algorithms and are compared to the original one. The arithmetic modifications are presented and the correspondent algorithms are synthesized and implemented on a low-cost FPGA. The results show a significant increase of the maximum frequency operation with a new proposed modified Loeffler algorithm.

international conference on multimedia computing and systems | 2011

FPGA implementation of a pipelined 2D-DCT and simplified quantization for real-time applications

Hatim Anas; Said Belkouch; M. El Aakif; Noureddine Chabini

The Discrete Cosine Transform (DCT) is one of the most widely used techniques for image compression. Several algorithms are proposed to implement the DCT-2D. The scaled SDCT algorithm is an optimization of the DCT-1D, which consists in gathering all the multiplications at the end. In this paper, in addition to the hardware implementation on an FPGA, an extended optimization has been performed by merging the multiplications in the quantization block without having an impact on the image quality. Tests using MATLAB environment have shown that our proposed approach produces images with quality comparable to the ones obtained using the JPEG standard. FPGA-based implementations of this proposed approach and the Loefflers algorithm are proposed and compared in this paper using an Altera Startix FPGA family with the synthesis and implementation tool Quartus II. Results show that our approach outperforms the well known Loefflers algorithm in terms of processing-speed and resources used.

Multimedia Tools and Applications | 2013

Design optimization of the quantization and a pipelined 2D-DCT for real-time applications

Anas Hatim; Said Belkouch; Mohamed El Aakif; Moha M'rabet Hassani; Noureddine Chabini

The Discrete Cosine Transform (DCT) is one of the most widely used techniques for image compression. Several algorithms are proposed to implement the DCT-2D. The scaled SDCT algorithm is an optimization of the DCT-1D, which consists in gathering all the multiplications at the end. In this paper, in addition to the hardware implementation on an FPGA, an extended optimization has been performed by merging the multiplications in the quantization block without having an impact on the image quality. A simplified quantization has been performed also to keep higher the performances of the all chain. Tests using MATLAB environment have shown that our proposed approach produces images with nearly the same quality of the ones obtained using the JPEG standard. FPGA-based implementations of this proposed approach is presented and compared to other state of the art techniques. The target is an an Altera Cyclone II FPGA using the Quartus synthesis tool. Results show that our approach outperforms the other ones in terms of processing-speed, used resources and power consumption. A comparison has been done between this architecture and a distributed arithmetic based architecture.

acs/ieee international conference on computer systems and applications | 2015

Area and delay aware approaches for realizing multi-operand addition on FPGAs using two-operand adders

Noureddine Chabini; Said Belkouch

Multi-operand addition is found in many real-life applications. In this paper, we propose two approaches for realizing multi-operand addition using two-operand adders on Field Programmed Gate Arrays (FPGAs). The proposed approaches reduce the area of the final implementation while reducing its propagation delay. We focus on the case where the operands are of different sizes.

international radar conference | 2014

Efficient implementation of a complete multi-beam radar coherent-processing on a telecom SoC

Mounir Bahtat; Said Belkouch; Phillipe Elleaume; Phillipe Le Gall

The processing bottleneck of modern multi-beam radar coherent-processing consists of the beamforming processing, the Doppler filtering and of the pulse compression stage. Pulse compression is a popular and an important technique in radars, which is known to be computationally expensive, therefore it was mainly implemented on ASICs or FPGAs due to the real-time and power constraints of many radar applications. Recent advances in multicore DSP architectures allow better flexible processing, reaching higher computational capability, while keeping the power consumption low. In this paper, we present an efficient implementation of a complete radar coherent-processing in a single TI SoC of 10W power consumption. The main optimization focus was on the pulse compression stage, where we proposed a different implementation approach optimizing memory usage and optimally parallelizing the processing in a multicore fashion, resulting in dramatic efficiency gains over conventional implementations. Experiments are done using the TI 6678 EVM and the TI 66AK2H EVM. We were able to implement the whole radar coherent-processing of “16 beams, 24 Doppler filters, 16 phased-array sensors and 1024 range cases sampled at 5MHz”, in only 3.2 C66 cores, fitting easily a single TI SoC of 10W power consumption, making a breakthrough in radar digital designs.

Archive | 2016

Efficient Implementation of Givens QR Decomposition on VLIW DSP Architecture for Orthogonal Matching Pursuit Image Reconstruction

Mohamed Najoui; Anas Hatim; Mounir Bahtat; Said Belkouch

Orthogonal Matching Pursuit (OMP) is one of the most used image reconstruction algorithm in compressed sensing technique (CS). This algorithm can be divided into two main stages: optimization problem and least square problem (LSP). The most complex and time consuming step of OMP is the LSP resolution. QR decomposition is one of the most used techniques to solve the LSP in a reduced processing time. In this paper, an efficient and optimized implementation of QR decomposition on TMS320C6678 floating point DSP is introduced. A parallel Givens algorithm is designed to make better use of the 2-way set associative cache. A special data arrangement was adopted to avoid cache misses and allow the use of some intrinsic functions. Our implementation reduces significantly the processing time; it is 6.7 times faster than the state of the art implementations. We have achieved a 1-core performance of 1.51 GFLOPS with speedups of up to x20 compared to Standard Givens Rotations (GR) algorithm.

international conference on microelectronics | 2014

Fast enumeration-based modulo scheduling heuristic for VLIW architectures

Mounir Bahtat; Said Belkouch; Phillipe Elleaume; Phillipe Le Gall

Modulo scheduling is a software pipelining technique exploiting instruction-level parallelism (ILP) of VLIW architectures to efficiently implement loops. This paper presents a novel enumeration-based resource-constrained heuristic for modulo scheduling. It takes into consideration the criticality of the nodes, generating near optimal schedules in terms of initiation intervals and register requirements. The scheduling algorithm outperformed better-known heuristics in terms of the quality of schedules, while presenting small compilation time enabling it to be used in a production environment. Experimental results on the VLIW TMS320C6678 DSP processor, showed improved performance on a signal processing set of algorithms.

international conference on multimedia computing and systems | 2011

An algorithm for reducing leakage power dissipation in combinational digital designs using dual threshold voltages

Noureddine Chabini; Said Belkouch

For CMOS-based nanometer technology, leakage power dissipation became an important issue in low power design. An approach to deal with this problem for timing constrained digital designs is to use dual threshold voltages. A low threshold voltage is used for computational elements on critical paths to satisfy timings, while a high threshold voltage can be used for the other elements off critical paths to reduce leakage power. The problem of assigning high threshold voltages to reduce leakage power under timing constraints is an NP-hard problem. In this paper, we present an approximate polynomial-time algorithm to address this problem. We also provide a Mixed Integer Linear Program (MILP) which optimally solves the problem for small designs. The proposed approach is compared with existing ones. Obtained experimental results are provided.

Explore More