Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Pedro Valero-Lara is active.

Publication


Featured researches published by Pedro Valero-Lara.


international conference on conceptual structures | 2017

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

Jack J. Dongarra; Sven Hammarling; Nicholas J. Higham; Samuel D. Relton; Pedro Valero-Lara; Mawussi Zounon

A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20.


international conference on parallel processing | 2012

MRF Satellite Image Classification on GPU

Pedro Valero-Lara

One of the stages of the analysis of satellite images is given by a classification based on the Markov Random Fields (MRF) method. It is possible to find in literature several packages to carry out this analysis, and of course the classification tasks. One of them is the Orfeo Tool Box (OTB). The analysis of satellite images is an expensive computational task requiring real time execution or automatization. In order to reduce the execution time spent on the analysis of satellite images, parallelism techniques can be used. Currently, Graphics Processing Units (GPUs) are becoming a good choice to reduce the execution time of several applications at a low cost. In this paper, the author presents a GPU-based classification using MRF from the sequential algorithm that appears in the OTB package. The experimental results show a spectacular reduction of the execution time for the GPU-based algorithm, up to 225 times faster than the sequential algorithm included in the OTB package. Moreover, this result is also observed in the total power consumption, which is reduced by a significant amount.


Concurrency and Computation: Practice and Experience | 2017

Reducing memory requirements for large size LBM simulations on GPUs

Pedro Valero-Lara

The scientific community in its never‐ending road of larger and more efficient computational resources is in need of more efficient implementations that can adapt efficiently on the current parallel platforms. Graphics processing units are an appropriate platform that cover some of these demands. This architecture presents a high performance with a reduced cost and an efficient power consumption. However, the memory capacity in these devices is reduced and so expensive memory transfers are necessary to deal with big problems. Today, the lattice‐Boltzmann method (LBM) has positioned as an efficient approach for Computational Fluid Dynamics simulations. Despite this method is particularly amenable to be efficiently parallelized, it is in need of a considerable memory capacity, which is the consequence of a dramatic fall in performance when dealing with large simulations. In this work, we propose some initiatives to minimize such demand of memory, which allows us to execute bigger simulations on the same platform without additional memory transfers, keeping a high performance. In particular, we present 2 new implementations, LBM‐Ghost and LBM‐Swap, which are deeply analyzed, presenting the pros and cons of each of them.


international conference on conceptual structures | 2017

cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project * .

Pedro Valero-Lara; Ivan Martínez-Pérez; Antonio J. Peña; Xavier Martorell; Raül Sirvent; Jesús Labarta

Abstract The simulation of the behavior of the Human Brain is one of the most important challenges today in computing. The main problem consists of finding efficient ways to manipulate and compute the huge volume of data that this kind of simulations need, using the current technology. In this sense, this work is focused on one of the main steps of such simulation, which consists of computing the Voltage on neurons’ morphology. This is carried out using the Hines Algorithm. Although this algorithm is the optimum method in terms of number of operations, it is in need of non-trivial modifications to be efficiently parallelized on NVIDIA GPUs. We proposed several optimizations to accelerate this algorithm on GPU-based architectures, exploring the limitations of both, method and architecture, to be able to solve efficiently a high number of Hines systems (neurons). Each of the optimizations are deeply analyzed and described. To evaluate the impact of the optimizations on real inputs, we have used 6 different morphologies in terms of size and branches. Our studies have proven that the optimizations proposed in the present work can achieve a high performance on those computations with a high number of neurons, being our GPU implementations about 4× and 8× faster than the OpenMP multicore implementation (16 cores), using one and two K80 NVIDIA GPUs respectively. Also, it is important to highlight that these optimizations can continue scaling even when dealing with number of neurons.


international conference on parallel processing | 2017

NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

Pedro Valero-Lara; Ivan Martínez-Pérez; Raül Sirvent; Xavier Martorell; Antonio J. Peña

The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. To achieve a good scalability using this approach is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). The results given in this study proves that the implementation carried out in this work is able to beat the reference code when dealing with a relatively large number of Tridiagonal systems (2,000–256,000), being closed to (3{times }) (in double precision) and (4{times }) (in single precision) faster using one Kepler NVIDIA GPU.


Proceedings of the 25th European MPI Users' Group Meeting on | 2018

MPI+OpenMP Tasking Scalability for the Simulation of the Human Brain: Human Brain Project

Pedro Valero-Lara; Raül Sirvent; Antonio J. Peña; Xavier Martorell; Jesús Labarta

The simulation of the behavior of the Human Brain is one of the most ambitious challenges today with a non-end of important applications. We can find many different initiatives in the USA, Europe and Japan which attempt to achieve such a challenging target. In this work we focus on the most important European initiative (Human Brain Project) and on one of the tools (Arbor). This tool simulates the spikes triggered in a neuronal network by computing the voltage capacitance on the neurons morphology, being one of the most precise simulators today. In the present work, we have evaluated the use of MPI+OpenMP tasking on top of the Arbor simulator. In this paper, we present the main characteristics of the Arbor tool and how these can be efficiently managed by using MPI+OpenMP tasking. We prove that this approach is able to achieve a good scaling even when computing a relatively low workload (number of neurons) per node using up to 32 nodes. Our target consists of achieving not only a highly scalable implementation based on MPI, but also to develop a tool with a high degree of abstraction without losing control and performance by using MPI+OpenMP tasking.


Concurrency and Computation: Practice and Experience | 2018

cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs: cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs

Pedro Valero-Lara; Ivan Martínez-Pérez; Raül Sirvent; Xavier Martorell; Antonio J. Peña

The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. The gtsvStridedBatch routine in the cuSPARSE NVIDIA package is one of these examples, which is used as reference in this article. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. Unlike other algorithms, the Thomas algorithm is sequential, and so a coarse‐grained approach is implemented where one CUDA thread solves a complete tridiagonal system instead of one CUDA block as in gtsvStridedBatch. To achieve a good scalability using this approach, it is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). Different variants regarding the transformation of the data are explored in detail. We also explore some variants for the case of variable batch, when the size of the systems of the batch has different size (cuThomasVBatch). The results given in this study prove that the implementations carried out in this work are able to beat the reference code, being up to 5× (in double precision) and 6× (in single precision) faster using the latest NVIDIA GPU architecture, the Pascal P100.


Archive | 2016

A Proposed API for Batched Basic Linear Algebra Subprograms

Jack J. Dongarra; Iain S. Duff; Mark Gates; Azzam Haidar; Sven Hammarling; Nicholas J. Higham; Jonathon Hogg; Pedro Valero-Lara; Samuel D. Relton; Stanimire Tomov; Mawussi Zounon


Archive | 2016

A Comparison of Potential Interfaces for Batched BLAS Computations

Samuel D. Relton; Pedro Valero-Lara; Mawussi Zounon


parallel, distributed and network-based processing | 2018

Variable Batched DGEMM

Pedro Valero-Lara; Ivan Martínez-Pérez; Sergi Mateo; Raül Sirvent; Vicenç Beltran; Xavier Martorell; Jesús Labarta

Collaboration


Dive into the Pedro Valero-Lara's collaboration.

Top Co-Authors

Avatar

Raül Sirvent

Barcelona Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar

Xavier Martorell

Polytechnic University of Catalonia

View shared research outputs
Top Co-Authors

Avatar

Antonio J. Peña

Barcelona Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar

Ivan Martínez-Pérez

Barcelona Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar

Jesús Labarta

Barcelona Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar

Mawussi Zounon

University of Manchester

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Sven Hammarling

Numerical Algorithms Group

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge