Jacobo Lobeiras | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jacobo Lobeiras is active.

Explore More

Publication

Featured researches published by Jacobo Lobeiras.

Concurrency and Computation: Practice and Experience | 2013

A multi‐GPU shallow‐water simulation with transport of contaminants

Moisés Viñas; Jacobo Lobeiras; Basilio B. Fraguela; Manuel Arenaz; Margarita Amor; José A. García; Manuel J. Castro; Ramón Doallo

This work presents cost‐effective multi‐graphics processing unit (GPU) parallel implementations of a finite‐volume numerical scheme for solving pollutant transport problems in bidimensional domains. The fluid is modeled by 2D shallow‐water equations, whereas the transport of pollutant is modeled by a transport equation. The 2D domain is discretized using a first‐order Roe finite‐volume scheme. Specifically, this paper presents multi‐GPU implementations of both a solution that exploits recomputation on the GPU and an optimized solution that is based on a ghost cell decoupling approach. Our multi‐GPU implementations have been optimized using nonblocking communications, overlapping communications and computations and the application of ghost cell expansion to minimize communications. The fastest one reached a speedup of 78 × using four GPUs on an InfiniBand network with respect to a parallel execution on a multicore CPU with six cores and two‐way hyperthreading per core. Such performance, measured using a realistic problem, enabled the calculation of solutions not only in real time but also in orders of magnitude faster than the simulated time.Copyright

parallel, distributed and network-based processing | 2011

FFT Implementation on a Streaming Architecture

Jacobo Lobeiras; Margarita Amor; Ramón Doallo

Fast Fourier Transform (FFT) is a useful tool for applications requiring signal analysis and processing. However, its high computational cost requires efficient implementations, specially if real time applications are used, where response time is a decisive factor. Thus, the computational cost and wide application range that requires FFT transforms has motivated the research of efficient implementations. Recently, GPU computing is becoming more and more relevant because of their high computational power and low cost, but due to its novelty there is some lack of tools and libraries. In this paper we propose an efficient implementation of the FFT with AMDs Brook+ language. We describe several features and optimization strategies, analyzing the scalability and performance compared to other well-known existing solutions.

ieee international conference on high performance computing data and analytics | 2013

Parallelization of shallow water simulations on current multi-threaded systems

Jacobo Lobeiras; Moisés Viñas; Margarita Amor; Basilio B. Fraguela; Manuel Arenaz; José Antonio Orosa García; Manuel J. Castro

In this work, several parallel implementations of a numerical model of pollutant transport on a shallow water system are presented. These parallel implementations are developed in two phases. First, the sequential code is rewritten to exploit the stream programming model. And second, the streamed code is targeted for current multi-threaded systems, in particular, multi-core CPUs and modern GPUs. The performance is evaluated on a multi-core CPU using OpenMP, and on a GPU using the streaming-oriented programming language Brook+, as well as the standard language for heterogeneous systems, OpenCL.

The Journal of Supercomputing | 2013

Influence of memory access patterns to small-scale FFT performance

Jacobo Lobeiras; Margarita Amor; Ramón Doallo

Modern GPUs (Graphics Processing Units) offer very high computing power at relative low cost. To take advantage of their computing resources and develop efficient implementations is essential to have certain knowledge about the architecture and memory hierarchy. In this paper, we use the FFT (Fast Fourier Transform) as a benchmark tool to analyze different aspects of GPU architectures, like the influence of the memory access pattern or the impact of the register pressure. The FFT is a good tool for performance analysis because it is used in many digital signal processing applications and has a good balance between computational cost and memory bandwidth requirements.

international conference on high performance computing and simulation | 2011

Simulation of pollutant transport in shallow water on a CUDA architecture

Moisés Viñas; Jacobo Lobeiras; Basilio B. Fraguela; Manuel Arenaz; Margarita Amor; Ramón Doallo

Shallow water simulation enables the study of problems such as dam break, river, canal and coastal hydrodynamics, as well as the transport of inert substances, such as pollutants, on a fluid. This article describes a GPU efficient and cost-effective CUDA implementation of a finite volume numerical scheme for solving pollutant transport problems in bidimensional domains. The fluid is modeled by 2D shallow water equations, while the transport of pollutant is modeled by a transport equation. The 2D domain is discretized using a first order finite volume scheme. The evaluation using a realistic problem shows that the implementation makes a good usage of the computational resources, being very efficient for real-life complex simulations. The speedup reached allowed us to complete a simulation in 2 hours in contrast with the 239 hours (10 days) required by a sequential execution in a standard CPU.

IEEE Transactions on Parallel and Distributed Systems | 2016

Designing Efficient Index-Digit Algorithms for CUDA GPU Architectures

Jacobo Lobeiras; Margarita Amor; Ramón Doallo

Modern graphics processing units (GPUs) offer very high computing power at relatively low cost. Nevertheless, designing efficient algorithms for the GPUs normally requires additional time and effort, even for experienced programmers. In this work we present a tuning methodology that allows the design for CUDA-enabled GPU architectures of index-digit algorithms, that is, algorithms where the data movement can be described as the permutations of the digits comprising the indices of the data elements. This methodology, based on two-stages identified as GPU resource analysis and operators string manipulation, is applied to FFT and tridiagonal systems solver algorithms, analyzing the performance features and the most adequate solutions. The resulting implementation is compact and outperforms other well-known and commonly used state-of-the-art libraries, with an improvement of up to 19.2 percent over NVIDIAs complex CUFFT , and more than 3000 percent over the NVIDIAsCUDPP for real data tridiagonal systems.

International Journal of Parallel Programming | 2015

BPLG: A Tuned Butterfly Processing Library for GPU Architectures

Jacobo Lobeiras; Margarita Amor; Ramón Doallo

In order to increase the efficiency of existing software many works are incorporating GPU processing. However, despite the current advances in GPU languages and tools, taking advantage of their parallel architecture is still far more complex than programming standard multi-core CPUs. In this work, we present a library based on a set of building blocks that enable to easily design well-known algorithms with little effort. More specifically, we implement butterfly algorithms with this library, that is, a set of orthogonal signal transforms and an algorithm to solve tridiagonal equations systems. Thanks to the parametrization of the building blocks, the library can be easily tuned depending on the desired GPU architecture. This generic approach can be used to easily design these GPU algorithms while obtaining competitive performance on two recent NVIDIA GPU architectures, which results specially interesting from the productivity point of view.

Proceedings of the Second Workshop on Accelerator Programming using Directives | 2015

Experiences in extending parallware to support OpenACC

Jacobo Lobeiras; Manuel Arenaz; Oscar R. Hernandez

Porting scientific codes to accelerator-based computers using OpenACC and OpenMP is an important topic for the HPC community. Programmability, performance portability and developer productivity are key issues for the widespread use of these systems. In the scope of general-purpose parallel computing, Parallware is a new commercial OpenMP-enabling source-to-source compiler that automatically adds OpenMP capabilities in scientific programs. Thus, extending Parallware with OpenACC or OpenMP 4.x support would contribute to improve programmability and developer productivity. In contrast, the performance portability of such approach needs to be demonstrated in practice. This paper presents a preliminary study to extend Parallware with OpenACC support for GPU devices. A simple benchmark suite has been designed to mimic important features and computational patterns of real scientific applications. Handcoded OpenACC versions are compared to OpenMP versions automatically generated by Parallware. Performance is evaluated with the PGI OpenACC compiler on systems accelerated with NVIDIA GPUs.

european conference on parallel processing | 2010

Streaming-oriented parallelization of domain-independent irregular kernels

Jacobo Lobeiras; Margarita Amor; Manuel Arenaz; Basilio B. Fraguela

Current parallelizing and optimizing compilers use techniques for the recognition of computational kernels to improve the quality of the target code. Domain-independent kernels characterize the computations carried out in an application, independently of the implementation details of a given programming language. This paper presents streaming-oriented parallelizing transformations for irregular assignment and irregular reduction kernels. The advantage of these code transformations is that they enable the parallelization of many algorithms with little effort without a depth knowledge of the particular application. The experimental results show the efficiency on current GPUs, although the main goal of the proposed techniques is not performance, but assist the programmer in the parallelization for a better productivity.

Archive | 2012