Is this you? Create Your Porfile

Felipe Sampaio

Universidade Federal do Rio Grande do Sul

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Felipe Sampaio is active.

Explore More

Publication

Featured researches published by Felipe Sampaio.

design automation conference | 2011

Run-time adaptive energy-aware motion and disparity estimation in multiview video coding

Bruno Zatt; Muhammad Shafique; Felipe Sampaio; Luciano Volcan Agostini; Sergio Bampi; Jörg Henkel

This paper presents a novel run-time adaptive energy-aware Motion and Disparity Estimation (ME, DE) architecture for Multiview Video Coding (MVC). It incorporates efficient memory access and data prefetching techniques for jointly reducing the on/off-chip memory energy consumption. A dynamically expanding search window is constructed at run time to reduce the off-chip memory accesses. Considering the multi-stage processing nature of advanced fast ME/DE schemes, a reduced-sized multi-bank on-chip memory is employed which can be power-gated depending upon the video properties. As a result, when tested for various video sequence, our approach provides a dynamic energy reduction of 82–96% for the off-chip memory and a leakage energy reduction of 57–75% for the on-chip memory compared to the Level-C and Level-C+ [7] prefetching techniques (which are the prominent data reuse and prefetching techniques in ME for video coding). The proposed ME/DE architecture is synthesized using a 65nm IBM low power technology. Compared to state-of-the-art MVC ME/DE hardware [14], our architecture provides 66% and 72% reduction in the area and power consumption, respectively. Moreover, our scheme achieves 30fps ME/DE 4-view HD1080p encoding with a power consumption of 74mW.

international conference on multimedia and expo | 2012

Motion Vectors Merging: Low Complexity Prediction Unit Decision Heuristic for the Inter-prediction of HEVC Encoders

Felipe Sampaio; Sergio Bampi; Mateus Grellert; Luciano Volcan Agostini; Júlio C. B. de Mattos

This paper presents the Motion Vectors Merging (MVM) heuristic, which is a method to reduce the HEVC inter-prediction complexity targeting the PU partition size decision. In the HM test model of the emerging HEVC standard, computational complexity is mostly concentrated in the inter-frame prediction step (up to 96% of the total encoder execution time, considering common test conditions). The goal of this work is to avoid several Motion Estimation (ME) calls during the PU inter-prediction decision in order to reduce the execution time in the overall encoding process. The MVM algorithm is based on merging NxN PU partitions in order to compose larger ones. After the best PU partition is decided, ME is called to produce the best possible rate-distortion results for the selected partitions. The proposed method was implemented in the HM test model version 3.4 and provides an execution time reduction of up to 34% with insignificant rate-distortion losses (0.08 dB drop and 1.9% bitrate increase in the worst case). Besides, there is no related work in the literature that proposes PU-level decision optimizations. When compared with works that target CU-level fast decision methods, the MVM shows itself competitive, achieving results as good as those works.

international conference on image processing | 2012

A memory aware and multiplierless VLSI architecture for the complete Intra Prediction of the HEVC emerging standard

Daniel Palomino; Felipe Sampaio; Luciano Volcan Agostini; Sergio Bampi; Altamiro Amadeu Susin

This work proposes a hardware architecture for the Intra Frame Prediction of the emerging High Efficiency Video Coding (HEVC) standard. The architecture was designed considering all innovative features of the Intra Prediction included in the HEVC, i.e. all modes and all Prediction Units (PU) sizes. Performance and memory accesses are a problem in the HEVC intra prediction and hardware architecture designs are good alternative to solve these issues, especially when energy-efficient solutions are targeted. Buffers and internal memories were used in the designed architecture to decrease the number of external memory accesses. Two independent data paths processing eight samples in parallel and a deep and multiplierless pipeline were designed to increase the throughput. The architecture was synthesized using an IBM 65nm CMOS technology. The results have shown that the architecture is able to process 30 HD720p frames per second and 13 HD1080p frames per second when running at 500 MHz, reducing in 95% the accesses to the external memory.

international symposium on circuits and systems | 2011

A high throughput H.264/AVC intra-frame encoding loop architecture for HD1080p

Cláudio Machado Diniz; Bruno Zatt; Cristiano Thiele; Altamiro Amadeu Susin; Sergio Bampi; Felipe Sampaio; Daniel Palomino; Luciano Volcan Agostini

In this work we present a high throughput hardware architecture for the H.264/AVC intra-frame encoder exploiting the parallelism of intra prediction, forward and inverse transforms and quantization. Since there is a strong data dependency between the intra prediction and the image reconstruction loop, the latency of this path is a key design issue in order to provide high performance coding. Considering that 77% of the total intra-encoding computation is spent in these modules, our architecture handles a 4-pixel wide intra prediction module and a 16-pixel wide reconstruction loop. Compared to the state-of-the-art our approach reduces by 47% the number of cycles to process a macroblock. Running at 150 MHz our architecture guarantees encoding of 61 HD1080p frames per second. The developed architecture requires 73.4 MHz to real-time encode HD1080p, which is a 46% reduction of the frequency requirement compared to the state-of-the-art.

design, automation, and test in europe | 2014

dSVM: Energy-efficient distributed Scratchpad Video Memory Architecture for the next-generation High Efficiency Video Coding

Felipe Sampaio; Muhammad Shafique; Bruno Zatt; Sergio Bampi; Jörg Henkel

An energy-efficient distributed Scratchpad Video Memory Architecture (dSVM) for the next-generation parallel High Efficiency Video Coding is presented. Our dSVM combines private and overlapping (shared) Scratchpad Memories (SPMs) to support data reuse within and across different cores concurrently executing multiple parallel HEVC threads. We developed a statistical method to size and design the organization of the SPMs along with a supporting memory reading policy for energy efficiency. The key is to leverage the HEVC and video content knowledge. Furthermore, we integrate an adaptive power management policy for SPMs to manage the power states of different memory parts at run time depending upon the varying video content properties. Our experimental results illustrate that our dSVM architecture reduces the overall memory energy consumption by up to 51%-61% compared to parallelized state-of-the-art solutions [11]. The dSVM external memory energy savings increase with an increasing number of parallel HEVC threads and size of search window. Moreover, our SPM power management reacts to the current video properties and achieves up to 54% on-chip leakage energy savings.

design, automation, and test in europe | 2013

Energy-efficient memory hierarchy for motion and disparity estimation in multiview video coding

Felipe Sampaio; Bruno Zatt; Muhammad Shafique; Luciano Volcan Agostini; Sergio Bampi; Jörg Henkel

This work presents an energy-efficient memory hierarchy for Motion and Disparity Estimation on Multiview Video Coding employing a Reference Frames-Centered Data Reuse (RCDR) scheme. In RCDR the reference search window becomes the center of the motion/disparity estimation processing flow and calls for processing all blocks requesting its data. By doing so, RCDR avoids multiple search window retransmissions leading to reduced number of external memory accesses, thus memory energy reduction. To deal with out-of-order processing and further reduce external memory traffic, a statistics-based partial results compressor is developed. The on-chip video memory energy is reduced by employing a statistical power gating scheme and candidate blocks reordering. Experimental results show that our reference-centered memory hierarchy outperforms the state-of-the-art [7][13] by providing reduction of up to 71% for external memory energy, 88% on-chip memory static energy, and 65% on-chip memory dynamic energy.

international conference on computer aided design | 2014

Energy-efficient architecture for advanced video memory

Felipe Sampaio; Muhammad Shafique; Bruno Zatt; Sergio Bampi; Jörg Henkel

An energy-efficient hybrid on-chip video memory architecture (enHyV) is presented that combines private and shared memories using a hybrid design (i.e., SRAM and emerging STT-RAM). The key is to leverage the application-specific properties to efficiently design and manage the enHyV. To increase STT-RAM lifetime, we propose a data management technique that alleviates the bit-toggling write occurrences. An adaptive power management is also proposed for static-energy savings. Experimental results illustrate that enHyV reduces on-chip static memory energy compared to SRAM-only version of enHyV and to state-of-art AMBER hybrid video memory [9] by 66%-75% and 55%-76%, respectively. Furthermore, negligible external memory energy consumption is required for reference frames communication (98% lower than state-of-the-art Level C+ technique [18]). Our data management significantly improves the enHyV STT-RAM lifetime, achieving 0.83 of normalized lifetime (near to the optimal case). Our hybrid memory design and management incur low overhead in terms of latency and dynamic energy.

international conference on image processing | 2012

Real-time block matching motion estimation onto GPGPU

Eduarda Monteiro; Marilena Maule; Felipe Sampaio; Cláudio Machado Diniz; Bruno Zatt; Sergio Bampi

This work presents an efficient method to map Motion Estimation (ME) algorithms onto General Purpose Graphic Processing Unit (GPGPU) architectures using CUDA programming model. Our method jointly exploits the massive parallelism available in current GPGPU devices and the parallelization potential of ME algorithms: Full Search (FS) and Diamond Search (DS). Our main goal is to evaluate the feasibility of achieving real-time high-definition video encoding performance running on GPUs. For comparison reasons, multi-core parallel and distributed versions of these algorithms were developed using OpenMP and MPI (Message Passing Interface) libraries, respectively. The CUDA-based solutions achieve the highest speed-up in comparison with OpenMP and MPI versions for both algorithms and, when compared to the state-of-the-art, our FS and DS solutions reach up to 18x and 11x speed-up, respectively.

International Journal of Reconfigurable Computing | 2012

DMPDS: a fast motion estimation algorithm targeting high resolution videos and its FPGA implementation

Gustavo Sanchez; Felipe Sampaio; Marcelo Schiavon Porto; Sergio Bampi; Luciano Volcan Agostini

This paper presents a new fastmotion estimation (ME) algorithm targeting high resolution digital videos and its efficient hardware architecture design. The new Dynamic Multipoint Diamond Search (DMPDS) algorithm is a fast algorithm which increases the ME quality when compared with other fast ME algorithms. The DMPDS achieves a better digital video quality reducing the occurrence of local minima falls, especially in high definition videos. The quality results show that the DMPDS is able to reach an average PSNR gain of 1.85 dB when compared with the well-known Diamond Search (DS) algorithm. When compared to the optimum results generated by the Full Search (FS) algorithm the DMPDS shows a lose of only 1.03 dB in the PSNR. On the other hand, the DMPDS reached a complexity reduction higher than 45 times when compared to FS. The quality gains related to DS caused an expected increase in the DMPDS complexity which uses 6.4-times more calculations than DS. The DMPDS architecture was designed focused on high performance and low cost, targeting to process Quad Full High Definition (QFHD) videos in real time (30 frames per second). The architecture was described in VHDL and synthesized to Altera Stratix 4 and Xilinx Virtex 5 FPGAs. The synthesis results show that the architecture is able to achieve processing rates higher than 53 QFHD fps, reaching the real-time requirements. The DMPDS architecture achieved the highest processing rate when compared to related works in the literature. This high processing rate was obtained designing an architecture with a high operation frequency and low numbers of cycles necessary to process each block.

international symposium on circuits and systems | 2011

A multilevel data reuse scheme for Motion Estimation and its VLSI design

Mateus Grellert; Felipe Sampaio; Júlio C. B. de Mattos; Luciano Volcan Agostini

Motion Estimation (ME) in video coding is a vital component that excels not only in computational complexity, but off-chip memory bandwidth as well. These two issues are considered critical constraints in terms of High Definition (HD) video coding, since a large volume of data must be processed. The multilevel data reuse scheme proposed in this paper is able to reduce the off-chip memory bandwidth, with direct impact in throughput and energy consumption. This scheme explores the concept of overlapped Search Windows (SW) in more than one level and poses no harm to video quality. Comparisons with related works show that this solution provides the best tradeoff between the use of on-chip memory and reduction of the off-chip memory bandwidth. The data reuse scheme was applied in a ME architecture and the synthesis results show that this solution presented the lowest use of hardware resources and the highest operation frequency among related works. The proposed architecture is able to process 1080p videos at 25 fps, and the reduction ratio of off-chip memory access achieved by the architecture is greater than 95% when compared to the traditional method.

Explore More