Gregorio Bernabé | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gregorio Bernabé is active.

Explore More

Publication

Featured researches published by Gregorio Bernabé.

parallel, distributed and network-based processing | 2009

A Parallel Implementation of the 2D Wavelet Transform Using CUDA

Joaquín Franco; Gregorio Bernabé; Juan C. Fernandez; Manuel E. Acacio

There is a multicore platform that is currently concentrating an enormous attention due to its tremendous potential in terms of sustained performance: the NVIDIA Tesla boards. These cards intended for general-purpose computing on graphic processing units (GPGPUs) are used as data-parallel computing devices. They are based on the Computed Unified Device Architecture (CUDA) which is common to the latest NVIDIA GPUs. The bottom line is a multicore platform which provides an enormous potential performance benefit driven by a non-traditional programming model. In this paper we try to provide some insight into the peculiarities of CUDA in order to target scientific computing by means of a specific example. In particular, we show that the parallelization of the two-dimensional fast wavelet transform for the NVIDIA Tesla C870 achieves a speedup of 20.8 for an image size of 8192x8192, when compared with the fastest host-only version implementation using OpenMP and including the data transfers between main memory and device memory.

ieee international conference on information technology and applications in biomedicine | 2000

A new lossy 3-D wavelet transform for high-quality compression of medical video

Gregorio Bernabé; José González; José M. García; José Duato

The authors present a new compression scheme based on applying the 3D Fast Wavelet Transform, to code medical video. This video has special features such as its representation in gray scale, the small amount of interframe variations, and the quality requirements of the reconstructed images. These characteristics as well as the social impact of desired applications deserve the design and implementation of coding schemes especially oriented to exploit its features. We analyze different parameters of the codification process, such as the utilization of different wavelet functions, the number of steps this function is applied, the way the thresholds are chosen, and the selected methods in the quantization and entropy encoder. Our coder achieves a good trade-off between compression ratio and quality of the reconstructed video. These results are better than MPEG-2, without the complexity of motion compensation.

Journal of Real-time Image Processing | 2012

The 2D wavelet transform on emerging architectures: GPUs and multicores

Joaquín Franco; Gregorio Bernabé; Juan C. Fernandez; Manuel Ujaldon

Because of the computational power of today’s GPUs, they are starting to be harnessed more and more to help out CPUs on high-performance computing. In addition, an increasing number of today’s state-of-the-art supercomputers include commodity GPUs to bring us unprecedented levels of performance in terms of raw GFLOPS and GFLOPS/cost. In this work, we present a GPU implementation of an image processing application of growing popularity: The 2D fast wavelet transform (2D-FWT). Based on a pair of Quadrature Mirror Filters, a complete set of application-specific optimizations are developed from a CUDA perspective to achieve outstanding factor gains over a highly optimized version of 2D-FWT run in the CPU. An alternative approach based on the Lifting Scheme is also described in Franco et al. (Acceleration of the 2D wavelet transform for CUDA-enabled Devices, 2010). Then, we investigate hardware improvements like multicores on the CPU side, and exploit them at thread-level parallelism using the OpenMP API and pthreads. Overall, the GPU exhibits better scalability and parallel performance on large-scale images to become a solid alternative for computing the 2D-FWT versus those thread-level methods run on emerging multicore architectures.

signal processing systems | 2005

Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions

Gregorio Bernabé; José M. García; Jose Gonzalez

The video compression algorithms based on the 3D wavelet transform obtain excellent compression rates at the expense of huge memory requirements, that drastically affects the execution time of such applications. Its objective is to allow the real-time video compression based on the 3D fast wavelet transform. We show the hardware and software interaction for this multimedia application on a general-purpose processor. First, we mitigate the memory problem by exploiting the memory hierarchy of the processor using several techniques. As for instance, we implement and evaluate the blocking technique. We present two blocking approaches in particular: cube and rectangular, both of which differ in the way the original working set is divided. We also put forward the reuse of previous computations in order to decrease the number of memory accesses and floating point operations. Afterwards, we present several optimizations that cannot be applied by the compiler due to the characteristics of the algorithm. On the one hand, the Streaming SIMD Extensions (SSE) are used for some of the dimensions of the sequence (y and time), to reduce the number of floating point instructions, exploiting Data Level Parallelism. Then, we apply loop unrolling and data prefetching to specific parts of the code. On the other hand, the algorithm is vectorized by columns, allowing the use of SIMD instructions for the y dimension. Results show speedups of 5x in the execution time over a version compiled with the maximum optimizations of the Intel C/C++ compiler, maintaining the compression ratio and the video quality (PSNR) of the original encoder based on the 3D wavelet transform. Our experiments also show that, allowing the compiler to perform some of these optimizations (i.e. automatic code vectorization), causes performance slowdown, demonstrating the effectiveness of our optimizations.

Journal of Systems and Software | 2009

A lossy 3D wavelet transform for high-quality compression of medical video

Gregorio Bernabé; José M. García; Jose Gonzalez

In this paper, we present a lossy compression scheme based on the application of the 3D fast wavelet transform to code medical video. This type of video has special features, such as its representation in gray scale, its very few interframe variations, and the quality requirements of the reconstructed images. These characteristics as well as the social impact of the desired applications demand a design and implementation of coding schemes especially oriented to exploit them. We analyze different parameters of the codification process, such as the utilization of different wavelets functions, the number of steps the wavelet function is applied to, the way the thresholds are chosen, and the selected methods in the quantization and entropy encoder. In order to enhance our original encoder, we propose several improvements in the entropy encoder: 3D-conscious run-length, hexadecimal coding and the application of arithmetic coding instead of Huffman. Our coder achieves a good trade-off between compression ratio and quality of the reconstructed video. We have also compared our scheme with MPEG-2 and EZW, obtaining better compression ratios up to 119% and 46%, respectively for the same PSNR.

international conference on conceptual structures | 2013

Optimization Techniques for 3D-FWT on Systems with Manycore GPUs and Multicore CPUs☆

Gregorio Bernabé; Javier Cuenca; Domingo Giménez

Abstract Programming manycore GPUs or multicore CPUs for high performance requires a careful balance of several hardware specific related factors, which is typically achieved by expert users through trial and error. To reduce the amount of hand-made optimization time required to achieve optimal performance, general guidelines can be followed or different metrics can be considered to predict performance, but ultimately a trial and error process is still prevalent. In this paper, we present an optimization method to run the 3D-Fast Wavelet Transform (3D-FWT) on hybrid systems. The optimization engine detects the different platforms found on a system, executing the appropriate kernel, implemented in both CUDA or OpenCL for GPUs, and programmed with pthreads for a CPU. Moreover, the proposed method selects automatically parameters such as the block size, the work-group size or the number of threads for reducing the execution time, obtaining the optimal performance in many cases. Finally, the optimization engine sends proportionally different parts of a video sequence to run concurrently in all platforms of the system. Speedups with respect to a normal user, who sends all frames to a GPU with a version of the 3D-FWT implemented in CUDA or OpenCL, presents an averaged gains of up to 7.93.

parallel, distributed and network-based processing | 2003

Reducing 3D wavelet transform execution time through the Streaming SIMD Extensions

Gregorio Bernabé; José M. García; Jose Gonzalez

This paper focuses on reducing the execution time of the video compression algorithms based on the 3D wavelet transform. We present several optimizations that could not be applied by the compiler due to the characteristics of the algorithm. First, we use the Streaming SIMD Extensions (SSE) for some of the dimensions of the sequence (y and time), in order to reduce the number of floating point instructions, exploiting data level parallelism. Then, we apply loop unrolling and data prefetching to critical parts of the code, and finally the algorithm is vectorized by columns, allowing the use of SIMD instructions for the y dimension. Results show improvements of up to 1.54 over a version compiled with the maximum optimizations of the Intel CIC++ compiler Our experiments also show that, allowing the compiler to perform some of these optimizations (i.e. automatic code vectorization) causes performance slowdown which demonstrates the effectiveness of our optimizations.

Proceedings. 28th Euromicro Conference | 2002

Memory conscious 3D wavelet transform

Gregorio Bernabé; José González; José M. García; José Duato

The video compression algorithms based on the 3D wavelet transform obtain excellent compression rates at the expense of huge memory requirements, which drastically affect the execution time of such applications. The goal of this work is to mitigate the memory problem by exploiting the memory hierarchy of the processor through blocking. In particular, we present two blocking approaches: cube and rectangular that differ in the way that the original working set is divided. We also propose the reuse of previous computations in order to decrease the number of memory accesses and floating point operations. Results show that the rectangular overlapped approach with computation reuse obtains the best results in terms of execution time, a speedup of 2.42 over the non-blocking non-overlapped wavelet transform, maintaining the compression ratio and the video quality (PSNR) of the original encoder based on the 3D wavelet transform.

latin american symposium on circuits and systems | 2012

CUDA and OpenCL implementations of 3D Fast Wavelet Transform

Gregorio Bernabé; Ginés D. Guerrero; Juan C. Fernandez

We present in this paper several implementations of the 3D Fast Wavelet Transform (3D-FWT) on CUDA and OpenCL running on a new Fermi Tesla architecture. We evaluate these proposals and make a comparison with others optimal executed on multicores CPU and Nvidia Tesla C870. Speedups of the CUDA version on Fermi architecture are the best results, improving the execution times on CPU, ranging from 5.3× to 7.4× for different image sizes, and up to 81 times faster when communications are neglected. Meanwhile, OpenCL obtains solid gains which range from 2× factors on small frame sizes to 3× factors on larger ones.

parallel computing | 2007

An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology

Gregorio Bernabé; Ricardo Fernandez; José M. García; Manuel E. Acacio; Jose Gonzalez

Video medical compression algorithms based on the 3D wavelet transform obtain both excellent compression rates and very good quality, at the expense of a higher execution time. The goal of this work is to improve the execution time of our 3D Wavelet Transform Encoder. We examine and exploit the characteristics and advantages of a hyper-threading processor. The Intel Hyper-threading Technology (HT) is a technique based on simultaneous multi-threading (SMT), which allows several independent threads to issue instructions to multiple functional units in a single cycle. In particular, we present two approaches: data-domain and functional, which differ in the way that the decomposition of the application is performed. The first approach is based on data division, where the same task is performed simultaneously by each thread on an independent part of the data. In the second approach, the processing is divided in different tasks that are executed concurrently on the same data set. Based on the latter approach, we present three proposals that differ in the way that the tasks of the application are divided between the threads. Results show speedups of up to 7% and 34% by the data-domain and functional decomposition, respectively, over a version executed without hyper-threading technology. Finally, we design several implementations of the best method with Pthreads and OpenMP using functional decomposition. We compare them in terms of execution speed, ease of implementation and maintainability of the resulting code.

Explore More