Nuno Roma | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nuno Roma is active.

Explore More

Publication

Featured researches published by Nuno Roma.

IEEE Transactions on Circuits and Systems for Video Technology | 2002

Efficient and configurable full-search block-matching processors

Nuno Roma; Leonel Sousa

Efficient VLSI architectures for motion estimation using the full-search block-matching algorithm are proposed in this paper. These structures are based on an improved and more efficient two-dimensional single-array architecture with minimum latency, maximum throughput, and full utilization of the hardware resources. This optimized architecture is extended to a class of fully parameterizable multiple array architectures that combine both pipelining and parallel processing techniques and provide the ability to configure the processors according to the setup parameters, the processing time and the circuit area specified limits. The development of a single-array processor in a single-chip based on a 0.25-/spl mu/m CMOS technology process proves the practical interest of the proposed architecture for implementing real-time motion estimators.

multimedia signal processing | 1999

Low-power array architectures for motion estimation

Leonel Sousa; Nuno Roma

This paper proposes new efficient low-power systolic architectures for full search-block matching (FS-BM) motion estimation. These architectures allow one to eliminate unnecessary computations, reducing the power consumption while preserving the optimal solution and the throughput. The new and traditional systolic architectures for motion estimation are compared with respect to required hardware and power consumption.

Signal Processing | 2011

Review: A tutorial overview on the properties of the discrete cosine transform for encoded image and video processing

Nuno Roma; Leonel Sousa

Discrete trigonometric transforms, such as the discrete cosine transform (DCT) and the discrete sine transform (DST), have been extensively used in signal processing for transform-based coding. The even type-II DCT, used in image and video coding, became specially popular to decorrelate the pixel data and minimize the spatial redundancy. Albeit this DCT tends to be the most often used, it integrates a broader family of transforms composed of eight DCTs and eight DSTs. However, even though most applications require little knowledge more than the actual DCT definition and its inverse, it is often widely regarded that the implementation of more complex operations on transformed data sequences (transcoding) requires a more in-depth knowledge about its precise definitions and formal mathematical properties. One of such relations is the multiplication-convolution property, often required to implement more specific and complex manipulations. Considering that such information is still spread into several documents and manuscripts, the main purpose of this article is to provide a broad set of practical and useful information in a single and self-contained source, embracing a wide range of definitions and properties related to the DCT and DST families, with a special emphasis on its application to image and video processing.

EURASIP Journal on Advances in Signal Processing | 2007

Efficient hybrid DCT-domain algorithm for video spatial downscaling

Nuno Roma; Leonel Sousa

A highly efficient video downscaling algorithm for any arbitrary integer scaling factor performed in a hybrid pixel transform domain is proposed. This algorithm receives the encoded DCT coefficient blocks of the input video sequence and efficiently computes the DCT coefficients of the scaled video stream. The involved steps are properly tailored so that all operations are performed using the encoding standard block structure, independently of the adopted scaling factor. As a result, the proposed algorithm offers a significant optimization of the computational cost without compromising the output video quality, by taking into account the scaling mechanism and by restricting the involved operations in order to avoid useless computations. In order to meet any system needs, an optional and possible combination of the presented algorithm with high-order AC frequency DCT coefficients discarding techniques is also proposed, providing a flexible and often required complexity scalability feature and giving rise to an adaptable tradeoff between the involved scalable computational cost and the resulting video quality and bit rate. Experimental results have shown that the proposed algorithm provides significant advantages over the usual DCT decimation approaches, both in terms of the involved computational cost, the output video quality, and the resulting bit rate. Such advantages are even more significant for scaling factors other than integer powers of 2 and may lead to quite high PSNR gains.

IEEE Transactions on Multimedia | 2014

Dynamic Load Balancing for Real-Time Video Encoding on Heterogeneous CPU+GPU Systems

Svetislav Momcilovic; Aleksandar Ilic; Nuno Roma; Leonel Sousa

The high computational demands and overall encoding complexity make the processing of high definition video sequences hard to be achieved in real-time. In this manuscript, we target an efficient parallelization and RD performance analysis of H.264/AVC inter-loop modules and their collaborative execution in hybrid multi-core CPU and multi-GPU systems. The proposed dynamic load balancing algorithm allows efficient and concurrent video encoding across several heterogeneous devices by relying on realistic run-time performance modeling and module-device execution affinities when distributing the computations. Due to an online adjustment of load balancing decisions, this approach is also self-adaptable to different execution scenarios. Experimental results show the proposed algorithms ability to achieve real-time encoding for different resolutions of high-definition sequences in various heterogeneous platforms. Speed-up values of up to 2.6 were obtained when compared to the video inter-loop encoding on a single GPU device, and up to 8.5 when compared to a highly optimized multi-core CPU execution. Moreover, the proposed algorithm also provides an automatic tuning of the encoding parameters, in order to meet strict encoding constraints.

Eurasip Journal on Embedded Systems | 2007

Adaptive motion estimation processor for autonomous video devices

Tiago Dias; Svetislav Momcilovic; Nuno Roma; Leonel Sousa

Motion estimation is the most demanding operation of a video encoder, corresponding to at least 80% of the overall computational cost. As a consequence, with the proliferation of autonomous and portable handheld devices that support digital video coding, data-adaptive motion estimation algorithms have been required to dynamically configure the search pattern not only to avoid unnecessary computations and memory accesses but also to save energy. This paper proposes an application-specific instruction set processor (ASIP) to implement data-adaptive motion estimation algorithms that is characterized by a specialized datapath and a minimum and optimized instruction set. Due to its low-power nature, this architecture is highly suitable to develop motion estimators for portable, mobile, and battery-supplied devices. Based on the proposed architecture and the considered adaptive algorithms, several motion estimators were synthesized both for a Virtex-II Pro XC2VP30 FPGA from Xilinx, integrated within an ML310 development platform, and using a StdCell library based on a 0.18 μ m CMOS process. Experimental results show that the proposed architecture is able to estimate motion vectors in real time for QCIF and CIF video sequences with a very low-power consumption. Moreover, it is also able to adapt the operation to the available energy level in runtime. By adjusting the search pattern and setting up a more convenient operating frequency, it can change the power consumption in the interval between 1.6 mW and 15 mW.

network and operating system support for digital audio and video | 2010

p264: open platform for designing parallel H.264/AVC video encoders on multi-core systems

António Rodrigues; Nuno Roma; Leonel Sousa

A highly modular and configurable platform for designing parallel H.264 video encoders on multi-core processors is presented. Departing from the H.264/AVC reference software, preliminary optimizations were conducted and new data structures were developed, in order to support the encoders parallelization and to confer the developed platform with a flexible, user configurable and highly scalable characteristics in what concerns the number of available cores to be used in the target concretization. After a careful assessment using different instantiations of the platform, the experimental results have shown that significant and close to linear speedups in what concerns the achieved frame-rate can be obtained, by simultaneously exploiting the several different parallelization models that are made available by this platform.

field-programmable logic and applications | 2003

Customisable Core-Based Architectures for Real-Time Motion Estimation on FPGAs

Nuno Roma; Tiago Dias; Leonel Sousa

This paper proposes new core-based architectures for motion estimation that are customisable for different coding parameters and hardware resources. These new cores are derived from an efficient and fully parameterisable 2-D single array systolic structure for full-search block-matching motion estimation and inherit its configurability properties in what concerns the macroblock dimension, the search area and parallelism level. The proposed architectures require significantly fewer hardware resources, by reducing the spatial and pixel resolutions rather than restricting the set of considered candidate motion vectors. Low-cost and low-power regular architectures suitable for field programmable logic implementation are obtained without compromising the quality of the coded video sequences. Experimental results show that despite the significant complexity level presented by motion estimation processors, it is still possible to implement fast and low-cost versions of the original core-based architecture using general purpose FPGA devices.

international conference on acoustics, speech, and signal processing | 2014

Cooperative CPU+GPU deblocking filter parallelization for high performance HEVC video codecs

Diego F. de Souza; Nuno Roma; Leonel Sousa

Heterogeneous platforms integrating several CPU cores and GPU accelerators have established in several application domains, from desktop, server and mobile. To take full advantage of such platforms, video encoders/decoders have to exploit a broader design space, by cooperatively executing in all the available CPU and GPU cores. To attain such objective, three novel contributions that aim the exploitation of the maximum parallelism level in an HEVC deblocking filter are presented: i) a highly optimized CPU parallel implementation, which outperforms the current state of the art; ii) the first known GPU implementation of the HEVC deblocking filter; and iii) an hybrid and load-balanced CPU+GPU implementation, where all the available resources cooperatively execute, in order to maximize the attained performance. The obtained experimental results demonstrated the ability to achieve processing times as low as 0.8 ms and 0.5 ms to filter 1080p I-type and B-type frames, respectively, corresponding to speedup factors as high as 17 and 9.

conference on ph.d. research in microelectronics and electronics | 2007

An ASIP approach for adaptive AVC Motion Estimation

Svetislav Momcilovic; Nuno Roma; Leonel Sousa

A new algorithm and an adapted hardware architecture of an ASIP are proposed in this paper. When compared with other hardware ASIP implementations, this architecture significantly speeds up the motion estimation procedure and substantially decreases the memory requirements. Moreover, it also makes use of significantly fewer memory accesses, still maintaining its coding quality performances in what concerns both the obtained bit rate and PSNR. As a consequence, the proposed algorithm proves to be specially adequate to be implemented in most embedded systems with restricted computational and power resources that are often adopted by portable and battery supplied devices.

Explore More