Michael Klaiber | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael Klaiber is active.

Explore More

Publication

Featured researches published by Michael Klaiber.

field-programmable technology | 2012

A memory-efficient parallel single pass architecture for connected component labeling of streamed images

Michael Klaiber; Lars Rockstroh; Zhe Wang; Yousef Baroud; Sven Simon

In classical connected component labeling algorithms the image has to be scanned two times. The amount of memory required for these algorithms is at least as high as for storing a full image. By using single pass connected component labeling algorithms, the memory requirement can be reduced by one order of magnitude to only a single image row. This memory reduction which avoids the requirement of high bandwidth external memory is essential to obtain a hardware efficient implementation on FPGAs. These single pass algorithms mapped one-to-one to hardware resources on FPGAs can process only one pixel per clock cycle in the best case. In order to enhance the performance a scalable parallel memory-efficient single pass algorithm for connected component labeling is proposed. The algorithm reduces the amount of memory required by the hardware architecture by a factor of 100 or more, for typical image sizes, compared to a recently proposed parallel connected component labeling algorithm. The architecture is also able to process an image stream with high throughput without the need of buffering a full image.

field-programmable technology | 2013

A high-throughput FPGA architecture for parallel connected components analysis based on label reuse

Michael Klaiber; Donald G. Bailey; Silvia Ahmed; Yousef Baroud; Sven Simon

A memory efficient architecture for single-pass connected components analysis suited for high throughput embedded image processing systems is proposed which achieves a high throughput by partitioning the image into several vertical slices processed in parallel. The low latency of the architecture allows reuse of labels associated with the image objects. This reduces the amount of memory by a factor of more than 5 compared to previous work. This is significant, since memory is a critical resource in embedded image processing on FPGAs.

field programmable logic and applications | 2014

Adaptive Dynamic On-chip Memory Management for FPGA-based reconfigurable architectures

Ghada Dessouky; Michael Klaiber; Donald G. Bailey; Sven Simon

In this paper, an adaptive architecture for dynamic management and allocation of on-chip FPGA Block Random Access Memory (BRAM) resources is presented. This facilitates the dynamic sharing of valuable and scarce on-chip memory among several processing elements (PEs), according to their dynamic run-time memory requirements. Different real-time applications are becoming increasingly dynamic which leads to unexpected and variable memory footprints, and static allocation of the worst-case memory requirements would result in costly overheads and inefficient memory utilization. The proposed scalable BRAM memory management architecture adaptively manages these dynamic memory requirements and balances the buffer memory over several PEs to reduce the total memory required, compared to the worst-case memory footprint for all PEs. The run-time adaptive system allocates BRAM to each PE sufficiently fast enough as required and utilized. In a case study, a significant improvement in BRAM utilization with limited overhead has been achieved due to the adaptive memory management architecture. The proposed system supports different BRAM types and configurations, and automated dynamic allocation and deallocation of BRAM resources, and is therefore well suited for the dynamic memory footprints of FPGA-based reconfigurable architectures.

IEEE Transactions on Circuits and Systems for Video Technology | 2016

A Resource-Efficient Hardware Architecture for Connected Component Analysis

Michael Klaiber; Donald G. Bailey; Yousef Baroud; Sven Simon

A resource-efficient hardware architecture for connected component analysis (CCA) of streamed video data is presented, which reduces the required hardware resources, especially for larger image widths. On-chip memory requirements increase with image width and dominate the resources of state-of-the-art CCA single-pass hardware architectures. A reduction in on-chip memory resources is essential to meet the ever increasing image sizes of high-definition (HD) and ultra-HD standards. The proposed architecture is resource efficient due to several innovations. An improved label recycling scheme detects the last pixel of an image object in the video stream only a few clock cycles after its occurrence, allowing the reuse of a label in the following image row. The coordinated application of these techniques leads to significant memory savings of more than two orders in magnitude compared with classical two-pass connected component labeling architectures. Compared with the most memory-efficient state-of-the-art single-pass CCA hardware architecture, 42% or more of on-chip memory resources are saved depending on the features extracted. Based on these savings, it is possible to realize an architecture processing video streams of larger images sizes, or to use a smaller and more energy-efficient field-programmable gate array device, or to increase the functionality of already existing image processing pipelines in reconfigurable computing and embedded systems.

computational science and engineering | 2013

Stream Processing of Scientific Big Data on Heterogeneous Platforms -- Image Analytics on Big Data in Motion

Seyyed Mahdi Najmabadi; Michael Klaiber; Zhe Wang; Yousef Baroud; Sven Simon

High performance image analytics is an important challenge for big data processing as image and video data is a huge portion of big data e.g. generated by a tremendous amount of image sensors worldwide. This paper presents a case study for image analytics namely the parallel connected component labeling (CCL) which is one of the first steps of image analytics in general. It is shown that a high performance CCL implementation can be obtained on a heterogeneous platform if parts of the algorithm are processed on a fine grain parallel field programmable gate array (FPGA) and a multi-core processor simultaneously. The proposed highly efficient architecture and implementation is suitable for the processing of big image and video data in motion and reduces the amount of memory required by the hardware architecture significantly for typical image sizes.

Journal of Real-time Image Processing | 2016

A single-cycle parallel multi-slice connected components analysis hardware architecture

Michael Klaiber; Donald G. Bailey; Sven Simon

Abstract In this paper, a memory-efficient architecture for single-pass connected components analysis suited for high-throughput embedded image processing systems is proposed which achieves a speedup by partitioning the image into slices. Although global data dependencies of image segments spanning several image slices exist, a temporal and spatial local algorithm is proposed, together with a suited FPGA hardware architecture processing pixel data at low latency. The low latency of the proposed architecture allows reuse of labels associated with the image objects. This reduces the amount of memory by a factor of more than 5 in the considered implementations which is a significant contribution since memory is a critical resource in embedded image processing on FPGAs. Therefore, a significantly higher bandwidth of pixel data can be processed with this architecture compared to the state-of-the-art architectures using the same amount of hardware resources.

image and vision computing new zealand | 2013

Efficient hardware calculation of running statistics

Donald G. Bailey; Michael Klaiber

Calculation of mean, variance and standard deviation are often required for segmentation or feature extraction. In image processing, often an integer approximation is adequate. Conventional methods require division and square root operations, which are expensive to realize in hardware in terms of both the amount of required resources and latency. A new class of iterative algorithms is developed based on integer arithmetic. An implementation of the algorithms as a hardware architecture for a Field-Programmable Gate Array (FPGA) is compared with architectures using the conventional approach, which shows a significantly reduced latency while using less hardware resources.

Proceedings of SPIE | 2012

Parallel hardware architecture for JPEG-LS based on domain decomposition

Silvia Ahmed; Zhe Wang; Michael Klaiber; S. Wahl; Marek Wroblewski; Sven Simon

JPEG-LS has a large number of different and independent context sets that provide the opportunity for par-allelism. As JPEG-LS, many of the lossless image compression standards have “adaptive” error modeling as the core part. This, however, leads to data dependency loops of the compression scheme such that a parallel compression of neighboring pixels is not possible. In this paper, a hardware architecture is proposed in order to achieve parallelism in the JPEG-LS compression. In the adaptive part of the algorithm, the context update and error modeling of a pixel belonging to a context number depends on the previous pixel having the same context number. On the other hand, the probability for two successive pixels to be in different contexts is only 17%. Thus storage is required for the intermediary pixels of the same context. In this architecture, a buffer mechanism is built to exploit the parallelism regardless of the adaptive characteristics. Despite the introduced architectural parallelism, the resulting JPEG-LS codec is fully compatible with the ISO/IEC 14495-1 JPEG-LS standard. A design for such a hardware system is provided here and simulated in FPGA which is also compared with a sequential pipelined architecture of JPEG-LS implemented in FPGA. The final design makes it possible to be applied with a streaming image sensor and does not require storing the entire image before compression. Thus it is capable of lossless compression of input images in real-time embedded systems.

Archive | 2016

A Real-Time Process Analysis System for the Simultaneous Acquisition of Spray Characteristics

Michael Klaiber; Zhe Wang; Sven Simon

In this paper, a Real-Time Process Analysis System for the characterization and measurement of spray and atomization processes is presented. Contrary to indirect measure methods such as phase Doppler interferometry (PDI) or laser diffraction, the proposed imaging system provides reliable measurement results for properties not only of (almost) spherical but also of arbitrarily shaped objects in spray and atomization processes. Compared to classical high-speed cameras which are based on the acquisition and storage of high-speed image sequences, the proposed Real-Time Process Analysis System evaluates a spray or atomization process for an arbitrary time period by processing each image frame in real time and transmitting the extracted measurement data to a PC. This allows the simultaneous measurement of a variety of spray properties. One example is the simultaneous measurement of the droplet or particle size and size distribution as well as the detection of pulsation and the pulsation frequency. Another example for the simultaneous detection of spray properties is the detection of droplet collisions and the measurement of the sizes of the colliding and collided droplets. Furthermore, filament formation or filament networks can be detected with the proposed system. Beyond the simultaneous measurement of various spray characteristics, the process analysis system computes all spray characteristics in real time so that these characteristics are immediately available in the experimental setup.

international conference on image processing | 2012

SSPQ - spatial domain perceptual image codec based on subsampling and perceptual quantization

Zhe Wang; Sven Simon; Michael Klaiber; Silvia Ahmed; Thomas Richter

A spatial domain perceptual image codec based on subsampling and perceptual quantization (SSPQ) guided by the just-noticeable distortion (JND) profile is proposed. SSPQ integrates perceptual image coding and progressive transmission in one framework. The input image is first subsampled by a factor of two in both dimensions and the subsampled image is compressed without loss. The subsampled image provides a basis for both predicting the input pixels by interpolation and estimating the JND values for each pixel. Residual quantization thresholds are set to the estimated JND values for a perceptually tuned compression. Quantized residuals are progressively encoded by a context-based Golomb coder with run-length coding capacity. Experimental results show over 50% improvement in compression performance on average for the proposed SSPQ codec compared to the lossless JPEG-LS.

Explore More