Said Belkouch
Cadi Ayyad University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Said Belkouch.
2011 Faible Tension Faible Consommation (FTFC) | 2011
M. El Aakif; Said Belkouch; Noureddine Chabini; Moha M'rabet Hassani
In this paper, a low power and fast DCT (Discrete Cosine Transform) using multiplier-less method is presented with a new modified FGA (Flow-Graph Algorithm), which is derived from our previously presented FGA of DCT based on Loeffler algorithm. The multiplier-less method is based on the replacement of multiplications with a minimum number of additions and shifts. The proposed FGA is performed and compared to a previous one. The results of FPGA implementations on Altera Cyclone II show the increase of the maximum frequency, the decrease of the resources usage and the reduction of the dynamic power by 7.2 % at 120 MHz of clock frequency with a new proposed FGA algorithm. Another comparison with recent published results has been done and proves the efficiency of the proposed FGA.
international conference on multimedia computing and systems | 2011
Mounir Arioua; Said Belkouch; Mohamed Agdad; Moha M'rabet Hassani
The Fast Fourier Transform (FFT) and its inverse transform (IFFT) processor are key components in many communication systems. An optimized implementation of the 8-point FFT processor with radix-2 algorithm in R2MDC architecture is presented in this paper. The butterfly — Processing Element (PE) used in the 8-FFT processor reduces the multiplicative complexity by using a real constant multiplication in one method and eliminates the multiplicative complexity by using add and shift operations in other proposed method. The pipeline architecture R2MDC has been implemented with the 8-point module and simulation results show that this module significantly achieves a better performance with lower resource usage.
international symposium on visual computing | 2010
Said Belkouch; M. El Aakif; A. Ait Ouahman; Moha M'rabet Hassani
In this paper, Discrete Cosine Transform hardware implementations are performed using two different modified Loeffler algorithms and are compared to the original one. The arithmetic modifications are presented and the correspondent algorithms are synthesized and implemented on a low-cost FPGA. The results show a significant increase of the maximum frequency operation with a new proposed modified Loeffler algorithm.
international conference on multimedia computing and systems | 2011
Hatim Anas; Said Belkouch; M. El Aakif; Noureddine Chabini
The Discrete Cosine Transform (DCT) is one of the most widely used techniques for image compression. Several algorithms are proposed to implement the DCT-2D. The scaled SDCT algorithm is an optimization of the DCT-1D, which consists in gathering all the multiplications at the end. In this paper, in addition to the hardware implementation on an FPGA, an extended optimization has been performed by merging the multiplications in the quantization block without having an impact on the image quality. Tests using MATLAB environment have shown that our proposed approach produces images with quality comparable to the ones obtained using the JPEG standard. FPGA-based implementations of this proposed approach and the Loefflers algorithm are proposed and compared in this paper using an Altera Startix FPGA family with the synthesis and implementation tool Quartus II. Results show that our approach outperforms the well known Loefflers algorithm in terms of processing-speed and resources used.
Multimedia Tools and Applications | 2013
Anas Hatim; Said Belkouch; Mohamed El Aakif; Moha M'rabet Hassani; Noureddine Chabini
The Discrete Cosine Transform (DCT) is one of the most widely used techniques for image compression. Several algorithms are proposed to implement the DCT-2D. The scaled SDCT algorithm is an optimization of the DCT-1D, which consists in gathering all the multiplications at the end. In this paper, in addition to the hardware implementation on an FPGA, an extended optimization has been performed by merging the multiplications in the quantization block without having an impact on the image quality. A simplified quantization has been performed also to keep higher the performances of the all chain. Tests using MATLAB environment have shown that our proposed approach produces images with nearly the same quality of the ones obtained using the JPEG standard. FPGA-based implementations of this proposed approach is presented and compared to other state of the art techniques. The target is an an Altera Cyclone II FPGA using the Quartus synthesis tool. Results show that our approach outperforms the other ones in terms of processing-speed, used resources and power consumption. A comparison has been done between this architecture and a distributed arithmetic based architecture.
acs/ieee international conference on computer systems and applications | 2015
Noureddine Chabini; Said Belkouch
Multi-operand addition is found in many real-life applications. In this paper, we propose two approaches for realizing multi-operand addition using two-operand adders on Field Programmed Gate Arrays (FPGAs). The proposed approaches reduce the area of the final implementation while reducing its propagation delay. We focus on the case where the operands are of different sizes.
international radar conference | 2014
Mounir Bahtat; Said Belkouch; Phillipe Elleaume; Phillipe Le Gall
The processing bottleneck of modern multi-beam radar coherent-processing consists of the beamforming processing, the Doppler filtering and of the pulse compression stage. Pulse compression is a popular and an important technique in radars, which is known to be computationally expensive, therefore it was mainly implemented on ASICs or FPGAs due to the real-time and power constraints of many radar applications. Recent advances in multicore DSP architectures allow better flexible processing, reaching higher computational capability, while keeping the power consumption low. In this paper, we present an efficient implementation of a complete radar coherent-processing in a single TI SoC of 10W power consumption. The main optimization focus was on the pulse compression stage, where we proposed a different implementation approach optimizing memory usage and optimally parallelizing the processing in a multicore fashion, resulting in dramatic efficiency gains over conventional implementations. Experiments are done using the TI 6678 EVM and the TI 66AK2H EVM. We were able to implement the whole radar coherent-processing of “16 beams, 24 Doppler filters, 16 phased-array sensors and 1024 range cases sampled at 5MHz”, in only 3.2 C66 cores, fitting easily a single TI SoC of 10W power consumption, making a breakthrough in radar digital designs.
Archive | 2016
Mohamed Najoui; Anas Hatim; Mounir Bahtat; Said Belkouch
Orthogonal Matching Pursuit (OMP) is one of the most used image reconstruction algorithm in compressed sensing technique (CS). This algorithm can be divided into two main stages: optimization problem and least square problem (LSP). The most complex and time consuming step of OMP is the LSP resolution. QR decomposition is one of the most used techniques to solve the LSP in a reduced processing time. In this paper, an efficient and optimized implementation of QR decomposition on TMS320C6678 floating point DSP is introduced. A parallel Givens algorithm is designed to make better use of the 2-way set associative cache. A special data arrangement was adopted to avoid cache misses and allow the use of some intrinsic functions. Our implementation reduces significantly the processing time; it is 6.7 times faster than the state of the art implementations. We have achieved a 1-core performance of 1.51 GFLOPS with speedups of up to x20 compared to Standard Givens Rotations (GR) algorithm.
international conference on microelectronics | 2014
Mounir Bahtat; Said Belkouch; Phillipe Elleaume; Phillipe Le Gall
Modulo scheduling is a software pipelining technique exploiting instruction-level parallelism (ILP) of VLIW architectures to efficiently implement loops. This paper presents a novel enumeration-based resource-constrained heuristic for modulo scheduling. It takes into consideration the criticality of the nodes, generating near optimal schedules in terms of initiation intervals and register requirements. The scheduling algorithm outperformed better-known heuristics in terms of the quality of schedules, while presenting small compilation time enabling it to be used in a production environment. Experimental results on the VLIW TMS320C6678 DSP processor, showed improved performance on a signal processing set of algorithms.
international conference on multimedia computing and systems | 2011
Noureddine Chabini; Said Belkouch
For CMOS-based nanometer technology, leakage power dissipation became an important issue in low power design. An approach to deal with this problem for timing constrained digital designs is to use dual threshold voltages. A low threshold voltage is used for computational elements on critical paths to satisfy timings, while a high threshold voltage can be used for the other elements off critical paths to reduce leakage power. The problem of assigning high threshold voltages to reduce leakage power under timing constraints is an NP-hard problem. In this paper, we present an approximate polynomial-time algorithm to address this problem. We also provide a Mixed Integer Linear Program (MILP) which optimally solves the problem for small designs. The proposed approach is compared with existing ones. Obtained experimental results are provided.