Yousef Baroud
University of Stuttgart
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yousef Baroud.
field-programmable technology | 2012
Michael Klaiber; Lars Rockstroh; Zhe Wang; Yousef Baroud; Sven Simon
In classical connected component labeling algorithms the image has to be scanned two times. The amount of memory required for these algorithms is at least as high as for storing a full image. By using single pass connected component labeling algorithms, the memory requirement can be reduced by one order of magnitude to only a single image row. This memory reduction which avoids the requirement of high bandwidth external memory is essential to obtain a hardware efficient implementation on FPGAs. These single pass algorithms mapped one-to-one to hardware resources on FPGAs can process only one pixel per clock cycle in the best case. In order to enhance the performance a scalable parallel memory-efficient single pass algorithm for connected component labeling is proposed. The algorithm reduces the amount of memory required by the hardware architecture by a factor of 100 or more, for typical image sizes, compared to a recently proposed parallel connected component labeling algorithm. The architecture is also able to process an image stream with high throughput without the need of buffering a full image.
field-programmable technology | 2013
Michael Klaiber; Donald G. Bailey; Silvia Ahmed; Yousef Baroud; Sven Simon
A memory efficient architecture for single-pass connected components analysis suited for high throughput embedded image processing systems is proposed which achieves a high throughput by partitioning the image into several vertical slices processed in parallel. The low latency of the architecture allows reuse of labels associated with the image objects. This reduces the amount of memory by a factor of more than 5 compared to previous work. This is significant, since memory is a critical resource in embedded image processing on FPGAs.
international symposium on parallel and distributed processing and applications | 2015
Seyyed Mahdi Najmabadi; Zhe Wang; Yousef Baroud; Sven Simon
In this paper two new hardware-based entropy coding architectures for asymmetric numeral systems are introduced, as entropy encoding is one of the major phases in a compression algorithm. The proposed architectures are based on tabled asymmetric numeral systems (tANS). The tabled asymmetric numeral systems combines the speed advantage of table based approaches (e.g. Huffman encoding) with the higher compression rate advantage of arithmetic encoding. Both proposed architectures have been synthesized to a state-of-the-art FPGA, and the synthesis results show high encoding throughput. The architectures are capable of encoding one symbol per clock cycle. The performance of the architectures depends on the number of symbols in the alphabet and may vary from 146 up to 290 Mega symbols per second (Msps).
IEEE Transactions on Circuits and Systems for Video Technology | 2016
Michael Klaiber; Donald G. Bailey; Yousef Baroud; Sven Simon
A resource-efficient hardware architecture for connected component analysis (CCA) of streamed video data is presented, which reduces the required hardware resources, especially for larger image widths. On-chip memory requirements increase with image width and dominate the resources of state-of-the-art CCA single-pass hardware architectures. A reduction in on-chip memory resources is essential to meet the ever increasing image sizes of high-definition (HD) and ultra-HD standards. The proposed architecture is resource efficient due to several innovations. An improved label recycling scheme detects the last pixel of an image object in the video stream only a few clock cycles after its occurrence, allowing the reuse of a label in the following image row. The coordinated application of these techniques leads to significant memory savings of more than two orders in magnitude compared with classical two-pass connected component labeling architectures. Compared with the most memory-efficient state-of-the-art single-pass CCA hardware architecture, 42% or more of on-chip memory resources are saved depending on the features extracted. Based on these savings, it is possible to realize an architecture processing video streams of larger images sizes, or to use a smaller and more energy-efficient field-programmable gate array device, or to increase the functionality of already existing image processing pipelines in reconfigurable computing and embedded systems.
computational science and engineering | 2013
Seyyed Mahdi Najmabadi; Michael Klaiber; Zhe Wang; Yousef Baroud; Sven Simon
High performance image analytics is an important challenge for big data processing as image and video data is a huge portion of big data e.g. generated by a tremendous amount of image sensors worldwide. This paper presents a case study for image analytics namely the parallel connected component labeling (CCL) which is one of the first steps of image analytics in general. It is shown that a high performance CCL implementation can be obtained on a heterogeneous platform if parts of the algorithm are processed on a fine grain parallel field programmable gate array (FPGA) and a multi-core processor simultaneously. The proposed highly efficient architecture and implementation is suitable for the processing of big image and video data in motion and reduces the amount of memory required by the hardware architecture significantly for typical image sizes.
international conference on systems signals and image processing | 2015
Zhe Wang; Sven Simon; Yousef Baroud; Seyyed Mahdi Najmabadi
A visually lossless image encoding extension for JPEG is presented. Such extension enables an efficient implementation of perceptual coding by reusing existing widespread software libraries and hardware IP cores for JPEG. For any pixel in a decoded image, the proposed algorithm guarantees a maximum distortion bounded by the just-noticeable distortion (JND) threshold measured based on the input image. Perceptual coding is performed in three steps: (1) standard transform domain coding, (2) spatial domain distortion visibility analysis by JND model and (3) spatial domain residual coding. Such scheme has been implemented in this work as an extension for JPEG based on a low complexity JND model. The encoder determines if a pixel block in a standard JPEG output image contains distortions beyond the visibility threshold given by the JND model. If it is true then the locations and the values of such distortions are encoded as side information. Quantization step size for the distortion values, i.e. perceptual residuals, are chosen based on the visibility threshold. Experimental results show that in terms of compression efficiency, the proposed perceptual encoding extension outperforms the standard JPEG encoder by 50% for a visually lossless compression of images.
signal processing algorithms architectures arrangements and applications | 2016
Yousef Baroud; José Manuel Mariños Velarde; Sven Simon
Parallel decoding of encoded data streams is an important scheme that is widely applied in image and video applications. Especially with the increasing demand on higher resolutions and frame rates, parallel decoding becomes very useful to meet the throughput requirements. Parallel decoding is conventionally enabled by inserting markers into the variable length code (VLC) stream. The markers allow for an easy separation of the sub-streams in order to be processed in parallel. The use of markers, however, adversely affects the compression making it unfavorable especially when the bandwidth of the channel is almost fully utilized. In a previous work, we proposed a proof of concept marker-free architecture that enables parallel decoding exhibiting the feasibility of the solution. In this paper, an efficient marker-free architecture is proposed. The memory requirements and the performance of the architecture are thoroughly analyzed. The architecture preserves the original compression factor of the system in contrary to the marker-based approaches. In the experimental results, it is shown that the architecture can be readily configured to drive up to 20 parallel decoders on a medium range FPGA at high clock rates. In addition, the hardware resources in terms of LUTs, registers and block RAMs are reported from which it can be seen that the resources tend to scale linearly with the number of decoders.
field programmable custom computing machines | 2016
Seyyed Mahdi Najmabadi; Zhe Wang; Yousef Baroud; Sven Simon
Online compression of I/O-data streams in Custom Computing Machines will enhance the effective network band-width of computing systems as well as storage bandwidth and capacity. In this paper a self-adaptive dynamic partial reconfigurable architecture for online compression is proposed. The proposed architecture will bring new possibilities in online compression due to its adaptivity to dynamic environments. In this study, network traffic, and input data distribution are considered as two dynamic behaviors. The degree of improvement provided by the architecture depends on data distribution, bandwidth, and available resources. Our analysis shows an improvement of up to 20% in compression ratios in comparison to non-adaptive approaches.
2016 International Conference on FPGA Reconfiguration for General-Purpose Computing (FPGA4GPC) | 2016
Seyyed Mahdi Najmabadi; Zhe Wang; Yousef Baroud; Sven Simon
Online compression of I/O-data streams in general purpose computing will enhance the effective I/O bandwidth of processors, the bandwidth of the computer network as well as the storage capacity and the read/write performance of the storage. In this paper, a self-adaptive dynamic partial reconfigurable architecture for the online compression of data streams is introduced. The proposed architecture will bring new possibilities in online compression due to its adaptivity to different factors like current data bandwidth, data statistics and the level of available resources and so forth. The architecture consists of multiple partially reconfigurable regions that are reconfigured dynamically with suited compression or decompression IP cores based on the above-mentioned factors at run time. The main goal of the adaptive online compression of the data stream is to provide maximum decompression throughput. The degree of improvement depends on the network throughput and the available resources. Our analysis shows an improvement up to 40% in decompression throughput in comparison to non-adaptive approaches.
Laser Beam Shaping XVIII | 2018
Andreas Faulhaber; Tobias Haist; Wolfgang Osten; Stefan Haberl; Marc Gronle; Yousef Baroud; Sven Simon
Measurement systems for metrology incorporating laser-triangulation methods have the problem of speckle noise. This noise is an effect of the coherence of the laser light in combination with the projection onto rough object surfaces. In this contribution, we show results for using spatial light modulators within a simple and effective method in order to reduce the speckle noise in laser-based triangulation. In the last decades, spatial light modulators have been intensively used for different applications in optical measurement systems. Today, the elements have high enough resolutions to be used even for simple holographic applications. We generate dynamic holograms with a pixelated spatial light modulator by inscribing multiple holograms. The laser-illuminated holograms microscopically translate the measuring point in the object plane. Due to the minimal different spot positions, the speckle patterns are also subject of change. By averaging of the intensity field in the camera plane the speckle noise can be reduced and the accuracy of the spots position measurement is increased. Furthermore, experimental measurements show features of correcting spot deformations due to optical system aberrations like defocus, astigmatism and coma.