Daniel Palomino
Universidade Federal do Rio Grande do Sul
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Daniel Palomino.
international conference on image processing | 2012
Daniel Palomino; Felipe Sampaio; Luciano Volcan Agostini; Sergio Bampi; Altamiro Amadeu Susin
This work proposes a hardware architecture for the Intra Frame Prediction of the emerging High Efficiency Video Coding (HEVC) standard. The architecture was designed considering all innovative features of the Intra Prediction included in the HEVC, i.e. all modes and all Prediction Units (PU) sizes. Performance and memory accesses are a problem in the HEVC intra prediction and hardware architecture designs are good alternative to solve these issues, especially when energy-efficient solutions are targeted. Buffers and internal memories were used in the designed architecture to decrease the number of external memory accesses. Two independent data paths processing eight samples in parallel and a deep and multiplierless pipeline were designed to increase the throughput. The architecture was synthesized using an IBM 65nm CMOS technology. The results have shown that the architecture is able to process 30 HD720p frames per second and 13 HD1080p frames per second when running at 500 MHz, reducing in 95% the accesses to the external memory.
international symposium on circuits and systems | 2011
Cláudio Machado Diniz; Bruno Zatt; Cristiano Thiele; Altamiro Amadeu Susin; Sergio Bampi; Felipe Sampaio; Daniel Palomino; Luciano Volcan Agostini
In this work we present a high throughput hardware architecture for the H.264/AVC intra-frame encoder exploiting the parallelism of intra prediction, forward and inverse transforms and quantization. Since there is a strong data dependency between the intra prediction and the image reconstruction loop, the latency of this path is a key design issue in order to provide high performance coding. Considering that 77% of the total intra-encoding computation is spent in these modules, our architecture handles a 4-pixel wide intra prediction module and a 16-pixel wide reconstruction loop. Compared to the state-of-the-art our approach reduces by 47% the number of cycles to process a macroblock. Running at 150 MHz our architecture guarantees encoding of 61 HD1080p frames per second. The developed architecture requires 73.4 MHz to real-time encode HD1080p, which is a 46% reduction of the frequency requirement compared to the state-of-the-art.
design, automation, and test in europe | 2016
Daniel Palomino; Muhammad Shafique; Altamiro Amadeu Susin; Jörg Henkel
This paper presents a thermal optimization technique that adaptively employs varying degree of approximations at both algorithm and data levels in order to reduce the temperature associated with the high efficiency video coding process while maintaining good quality results. The technique evaluates, at run-time, the regions of a video sequence, frame-by-frame, in terms of tolerance to imprecise computations. It adapts the amount of approximation errors based on the video sequence properties and application-specific knowledge. The proposed technique adaptively controls the strength of approximations (at both algorithm and data levels) depending upon the varying resilience properties of coding different regions with different texture/motion properties. Our content-driven approximate computing technique demonstrates the potential to improve the thermal profile of a chip. Experimental results show that our technique improves temperature profiles by reducing the on-chip temperature by about 10° C on average, while maintaining good quality results.
international conference on electronics, circuits, and systems | 2010
Guilherme Corrêa; Cláudio Machado Diniz; Sergio Bampi; Daniel Palomino; Roger Endrigo Carvalho Porto; Luciano Volcan Agostini
In order to achieve the best coding performance, the H.264/AVC encoder must choose the best coding mode and the best block size in terms of bit-rate and distortion. The H.264/AVC reference software applies the Rate-Distortion Optimization (RDO) technique, which makes the encoding process a complex task in applications which require real-time operation. This paper presents a fast intra decision process and its architecture, where the mode and the block size decisions are performed based on the block distortion and homogeneity. The performed tests show that the method achieves PSNR results similar to the RDO technique and a low bit-rate increase. On the other hand, the gains in terms of complexity are near to 148 times when compared to RDO method. Also, the implemented architecture is capable of processing 1080p videos in real time.
picture coding symposium | 2013
Daniel Palomino; Eduardo Cavichioli; Altamiro Amadeu Susin; Luciano Volcan Agostini; Muhammad Shafique; Jörg Henkel
This paper presents a fast mode decision algorithm for the HEVC intra prediction. A new evaluation order in the Coding Tree Block (CTB) allows the use of modes from low level PUs to be used as reference to the current PU decision. In this paper we use this idea to develop a fast intra mode decision algorithm that can be configured to run in two different complexity modes, relaxed and aggressive. Experimental results have shown that our algorithm achieved encoding time savings of almost 60% with negligible loss in the compression efficiency when compared to the full RDO based decision. Besides, our mode decision algorithm presented the best result in terms of time saving per compression efficiency when compared with all related works.
design, automation, and test in europe | 2014
Daniel Palomino; Muhammad Shafique; Hussam Amrouch; Altamiro Amadeu Susin; Jörg Henkel
This paper presents an application-driven algorithm for Dynamic Thermal Management (DTM) for the High Efficiency Video Coding (HEVC). For efficient design of such a DTM policy, we perform an offline thermal analysis of an HEVC encoder and demonstrate the impact of different video sequences and different coding configurations on the processor temperature. Our thermal analysis is leveraged to develop an efficient application-driven DTM policy that performs temperature-aware coding along with an application-driven control of DTM knobs (e.g., frequency scaling) in order to meet the temperature constraints while still providing high video quality (i.e. PSNR loss <; 0.01dB). For accurate thermal analysis and evaluation, we deploy an infrared camera-based thermal measurement setup that, on the contrary to state-of-the-art setups, does not require adding any extra layer on top of the measured chip, thus allowing the camera to accurately capture the infrared emissions from the die.
symposium on integrated circuits and systems design | 2011
Daniel Palomino; Guilherme Corrêa; Cláudio Machado Diniz; Sergio Bampi; Luciano Volcan Agostini; Altamiro Amadeu Susin
In the Rate-Distortion Optimization (RDO) technique for video encoding, the process of choosing the best prediction mode is performed through exhaustive executions of the whole encoding process, which increases significantly the encoder computational complexity. Considering H.264/AVC intra-frame prediction there are several modes to encode each macroblock (MB). In order to reduce the number of calculations necessary to determine the best intra-frame mode, this work proposes an algorithm and the hardware design for a fast intra-frame mode decision module for H.264/AVC encoders. The application of the proposed algorithm reduces in more than ten times the number of encoding iterations for choosing the best intra-frame mode when compared with RDO-based decision, at the cost of relatively small bit-rate increase (5% in average) and image quality loss (0.2 dB in PSNR). The architecture takes 36 clock cycles to perform the intra-frame decision for one MB and it achieved an operation frequency of 130 MHz when synthesized for TSMC 0.18μm, being able to process more than 400 HD1080p frames per second. With this approach, we achieved one order-of-magnitude performance improvement compared with RDO-based approaches, which is very important not only from the performance but also from the energy consumption perspective when considering the need to improve the video encoding efficiency for battery operated devices. Compared with the best previous results reported, the implemented architecture achieve a complexity reduction of five times, a processing capability increase of 14 times and a reduction in the number of clock cycles per MB of 11 times.
international symposium on low power electronics and design | 2014
Daniel Palomino; Muhammad Shafique; Altamiro Amadeu Susin; Jörg Henkel
This paper presents an adaptive temperature optimization technique for the next generation video encoders. It exploits both application-specific knowledge (i.e. video encoding configurations) and video content properties in order to efficiently manage the temperature of advanced video coding systems at the software layer. For designing an efficient technique, we perform an extensive offline analysis to understand the impact of different video properties and configurations on the CPU thermal profiles when processing the next generation video encoder. Our temperature optimization technique performs an application-level prediction of the temperature trend followed by an application-level thermal management policy. The policy dynamically manages the temperature by performing an adaptive encoder configuration selection while providing minimum penalties in terms of bit rate and video quality. The experimental results show that our policy meets temperature constraints with negligible encoding performance loss. Moreover, when compared to state-of-the-art techniques, our policy provides a relatively reduced video quality loss while still meeting the temperature constraints.
Vlsi Design | 2012
Guilherme Corrêa; Daniel Palomino; Cláudio Machado Diniz; Sergio Bampi; Luciano Volcan Agostini
In H.264/AVC, the encoding process can occur according to one of the 13 intraframe coding modes or according to one of the 8 available interframes block sizes, besides the SKIP mode. In the Joint Model reference software, the choice of the best mode is performed through exhaustive executions of the entire encoding process, which significantly increases the encoders computational complexity and sometimes even forbids its use in real-time applications. Considering this context, this work proposes a set of heuristic algorithms targeting hardware architectures that lead to earlier selection of one encoding mode. The amount of repetitions of the encoding process is reduced by 47 times, at the cost of a relatively small cost in compression performance. When compared to other works, the fast hierarchical mode decision results are expressively more satisfactory in terms of computational complexity reduction, quality, and bit rate. The low-complexity mode decision architecture proposed is thus a very good option for real-time coding of high-resolution videos. The solution is especially interesting for embedded and mobile applications with support to multimedia systems, since it yields good compression rates and image quality with a very high reduction in the encoder complexity.
symposium on integrated circuits and systems design | 2009
Robson Dornelles; Felipe Sampaio; Daniel Palomino; Luciano Volcan Agostini
This paper presents an architecture for a dedicated transforms and quantization loop targeting the Intra Prediction of the H.264/AVC video coding standard. The transforms and quantization loop is a bottleneck for the Intra Prediction, since the prediction process cannot start before the transforms and quantization loop finishes the reconstruction of the reference blocks. This work presents a low latency and high throughput architecture of the transforms and quantization loop, intending to reduce the impact of this loop in the Intra Prediction performance. The architecture was described in VHDL and synthesized to Altera Stratix II FPGA and to the TSMC 0.18μm CMOS technology. The architecture reaches an operation frequency of 129.3 MHz when mapped to standardcells, and takes 4 cycles to process an I4MB block, 24 cycles to process an I16MB block, and 22 cycles to process a chroma block. Those are the best results when compared with all related works considering the Intra Prediction constraints.