Cláudio Machado Diniz

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cláudio Machado Diniz is active.

Explore More

Publication

Featured researches published by Cláudio Machado Diniz.

International Journal of Parallel Programming | 2014

Parallelization of Full Search Motion Estimation Algorithm for Parallel and Distributed Platforms

Eduarda Monteiro; Bruno Boessio Vizzotto; Cláudio Machado Diniz; Marilena Maule; Bruno Zatt; Sergio Bampi

This work presents an efficient method to map the Full Search algorithm for Motion Estimation (ME) onto General Purpose Graphic Processing Unit (GPGPU) architectures using Compute Unified Device Architecture (CUDA) programming model. Our method jointly exploits the massive parallelism available in current GPGPU devices and the parallelism potential of Full Search algorithm. Our main goal is to evaluate the feasibility of video codecs implementation using GPGPUs and its advantages and drawbacks compared to other platforms. Therefore, for comparison reasons, three solutions were developed using distinct programming paradigms for distinct underlying hardware architectures: (i) a sequential solution for general-purpose processor (GPP); (ii) a parallel solution for multi-core GPP using OpenMP library; (iii) a distributed solution for cluster/grid machines using Message Passing Interface (MPI) library. The CUDA-based solution for GPGPUs achieves speed-up compatible to the indicated by the theoretical model for different search areas. Our GPGPU Full Search Motion Estimation provides 2×, 20× and 1664× speed-up when compared to MPI, OpenMP and sequential implementations, respectively. Compared to state-of-the-art, our solution reaches up to 17× speed-up.

international conference on image processing | 2013

High-throughput interpolation hardware architecture with coarse-grained reconfigurable datapaths for HEVC

Cláudio Machado Diniz; Muhammad Shafique; Sergio Bampi; Jörg Henkel

Fractional-pel interpolation for motion estimation and motion compensation is one of the key computational hotspots in the new High Efficient Video Coding (HEVC) standard. This work presents a high-throughput interpolation hardware architecture to improve performance of HEVC encoding and decoding. It employs two acceleration engines for luma and chroma filtering, each with 12-pel-parallel coarse-grained reconfigurable interpolation datapaths. An adaptive scheduling scheme manages the operation of these interpolation datapaths in different ways depending upon the prediction unit (PU) size and the execution scenario (i.e. motion estimation or motion compensation). We have implemented our hardware architecture in 150 nm technology. Compared to state-of-the-art techniques [12], our architecture required 49% less hardware area, while processing QFHD (3840×2160) resolution @ 30 fps.

international symposium on circuits and systems | 2011

A high throughput H.264/AVC intra-frame encoding loop architecture for HD1080p

Cláudio Machado Diniz; Bruno Zatt; Cristiano Thiele; Altamiro Amadeu Susin; Sergio Bampi; Felipe Sampaio; Daniel Palomino; Luciano Volcan Agostini

In this work we present a high throughput hardware architecture for the H.264/AVC intra-frame encoder exploiting the parallelism of intra prediction, forward and inverse transforms and quantization. Since there is a strong data dependency between the intra prediction and the image reconstruction loop, the latency of this path is a key design issue in order to provide high performance coding. Considering that 77% of the total intra-encoding computation is spent in these modules, our architecture handles a 4-pixel wide intra prediction module and a 16-pixel wide reconstruction loop. Compared to the state-of-the-art our approach reduces by 47% the number of cycles to process a macroblock. Running at 150 MHz our architecture guarantees encoding of 61 HD1080p frames per second. The developed architecture requires 73.4 MHz to real-time encode HD1080p, which is a 46% reduction of the frequency requirement compared to the state-of-the-art.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2015

A Reconfigurable Hardware Architecture for Fractional Pixel Interpolation in High Efficiency Video Coding

Cláudio Machado Diniz; Muhammad Shafique; Sergio Bampi; Jörg Henkel

We present a novel reconfigurable hardware architecture for interpolation filtering in high efficient video coding that adapts to run-time changes of the number of interpolation filter calls and thereby provides a high potential of energy efficiency. It employs a picture-based prediction scheme to estimate the number of interpolation filter calls at run-time by monitoring the group of pictures history based on video coding structure knowledge. Reconfigurable acceleration engines are developed that can adapt to different filter types. Dynamic composition of different instances of these engines enables different implementation versions with area versus throughput tradeoff. A run-time selection scheme determines the best implementation version for each picture based on the throughput requirements. Compared to state-of-the-art, our architecture reduces resource usage by 57% while supporting various throughputs and video resolutions.

international conference on electronics, circuits, and systems | 2010

Homogeneity and distortion-based intra mode decision architecture for H.264/AVC

Guilherme Corrêa; Cláudio Machado Diniz; Sergio Bampi; Daniel Palomino; Roger Endrigo Carvalho Porto; Luciano Volcan Agostini

In order to achieve the best coding performance, the H.264/AVC encoder must choose the best coding mode and the best block size in terms of bit-rate and distortion. The H.264/AVC reference software applies the Rate-Distortion Optimization (RDO) technique, which makes the encoding process a complex task in applications which require real-time operation. This paper presents a fast intra decision process and its architecture, where the mode and the block size decisions are performed based on the block distortion and homogeneity. The performed tests show that the method achieves PSNR results similar to the RDO technique and a low bit-rate increase. On the other hand, the gains in terms of complexity are near to 148 times when compared to RDO method. Also, the implemented architecture is capable of processing 1080p videos in real time.

pacific-rim symposium on image and video technology | 2007

A pipelined 8×8 2-D forward DCT hardware architecture for H.264/AVC high profile encoder

Thaísa Leal da Silva; Cláudio Machado Diniz; João Alberto Vortmann; Luciano Volcan Agostini; Altamiro Amadeu Susin; Sergio Bampi

This paper presents the hardware design of an 8×8 bidimensional Forward Discrete Cosine Transform used in the high profiles of the H.264/AVC video coding standard. The designed DCT is computed in a separate way as two 1-D transforms. It uses only add and shift operations, avoiding multiplications. The architecture contains one datapath for each 1-D DCT with a transpose buffer between them. The complete architecture was synthesized to Xilinx Virtex II - Pro and Altera Stratix II FPGAs and to TSMC 0.35µm standard-cells technology. The synthesis results show that the 2-D DCT transform architecture reached the necessary throughput to encode high definition videos in real-time when considering all target technologies.

latin american symposium on circuits and systems | 2011

Synthesis and comparison of low-power high-throughput architectures for SAD calculation

Fábio Leandro Walter; Cláudio Machado Diniz; Sergio Bampi

This paper presents the standard-cells synthesis and comparison of parallel hardware architectures for the Sum of Absolute Differences (SAD) datapath, focusing on different design points such as high-performance (maximum throughput) and the tradeoff between high-performance and low-power dissipation (isoperformance target). Clock gating and different combination of parallelism and pipeline architectural techniques were explored in this work. In order to generate the results, we used the TSMC 0.18 µm standard-cells library with standard, slow and fast cells, and the back-end Cadence tools, e.g. Power Analysis, for the power measurements. We achieved significant power reduction for the architectures with low-frequency and high parallelism, slow cells and mainly with only one pipeline stage.

international conference on multimedia and expo | 2009

A real time H.264/AVC intra frame prediction hardware architecture for HDTV 1080P video

Cláudio Machado Diniz; Bruno Zatt; Luciano Volcan Agostini; Altamiro Amadeu Susin; Sergio Bampi

This work presents an intra frame prediction hardware architecture for H.264/AVC baseline/main profile encoder which performs real time processing of HDTV 1080p videos. It is achieved by exploring the parallelism of intra prediction and by reducing the latency for Intra 4×4 processing, which is the intra encoding bottleneck. Synthesis results on Xilinx Virtex-II Pro FPGA and TSMC 0.18µm standard-cells indicate that this architecture is able to real time encode HDTV 1080p video operating at 110 MHz. Our architecture can encode HD1080p, 720p and SD video in real time at a frequency 25% lower when compared to similar works.

design, automation, and test in europe | 2015

A deblocking filter hardware architecture for the high efficiency video coding standard

Cláudio Machado Diniz; Muhammad Shafique; Felipe Vogel Dalcin; Sergio Bampi; Jörg Henkel

The new deblocking filter (DF) tool of the next generation High Efficiency Video Coding (HEVC) standard is one of the most time consuming algorithms in video decoding. In order to achieve real-time performance at low-power consumption, we developed a hardware accelerator for this filter. This paper proposes a high throughput hardware architecture for HEVC deblocking filter employing hardware reuse to accelerate filtering decision units with a low area cost. Our architecture achieves either higher or equivalent throughput (4096×2048 @ 60 fps) with 5X-6X lower area compared to state-of-the-art deblocking filter architectures.

symposium on integrated circuits and systems design | 2011

Algorithm and hardware design of a fast intra-frame mode decision module for h.264/AVC encoders

Daniel Palomino; Guilherme Corrêa; Cláudio Machado Diniz; Sergio Bampi; Luciano Volcan Agostini; Altamiro Amadeu Susin

In the Rate-Distortion Optimization (RDO) technique for video encoding, the process of choosing the best prediction mode is performed through exhaustive executions of the whole encoding process, which increases significantly the encoder computational complexity. Considering H.264/AVC intra-frame prediction there are several modes to encode each macroblock (MB). In order to reduce the number of calculations necessary to determine the best intra-frame mode, this work proposes an algorithm and the hardware design for a fast intra-frame mode decision module for H.264/AVC encoders. The application of the proposed algorithm reduces in more than ten times the number of encoding iterations for choosing the best intra-frame mode when compared with RDO-based decision, at the cost of relatively small bit-rate increase (5% in average) and image quality loss (0.2 dB in PSNR). The architecture takes 36 clock cycles to perform the intra-frame decision for one MB and it achieved an operation frequency of 130 MHz when synthesized for TSMC 0.18μm, being able to process more than 400 HD1080p frames per second. With this approach, we achieved one order-of-magnitude performance improvement compared with RDO-based approaches, which is very important not only from the performance but also from the energy consumption perspective when considering the need to improve the video encoding efficiency for battery operated devices. Compared with the best previous results reported, the implemented architecture achieve a complexity reduction of five times, a processing capability increase of 14 times and a reduction in the number of clock cycles per MB of 11 times.

Explore More