Is this you? Create Your Porfile

Juan Carlos Pichel

University of Santiago de Compostela

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Juan Carlos Pichel is active.

Explore More

Publication

Featured researches published by Juan Carlos Pichel.

parallel, distributed and network-based processing | 2004

Improving the locality of the sparse matrix-vector product on shared memory multiprocessors

Juan Carlos Pichel; Dora Blanco Heras; José Carlos Cabaleiro; Francisco F. Rivera

We extend a model of locality and the subsequent process of locality improvement previously developed for the case of sparse algebra codes in monoprocessors to the case of NUMA shared memory multiprocessors (SMPs). In particular the product of a sparse matrix by a dense vector (SpM/spl times/V) is studied. In the model, locality is established at run-time considering parameters that describe the structure of the sparse matrix involved in the computations. The problem of increasing the locality is formulated as a graph problem, whose solution indicates some appropriate reordering of rows and columns of the sparse matrix. The reordering algorithms were tested for a broad set of matrices. We have also performed a comparison with other reordering algorithms. The results lead to general conclusions about improving SMP performance for other sparse algebra codes.

Microprocessors and Microsystems | 2012

Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs

Juan Carlos Pichel; Francisco F. Rivera; Marcos Fernández; Aurelio Rodriguez

It is well-known that reordering techniques applied to sparse matrices are common strategies to improve the performance of sparse matrix operations, and particularly, the sparse matrix vector multiplication (SpMV) on CPUs. In this paper, we have evaluated some of the most successful reordering techniques on two different GPUs. In addition, in our study a number of sparse matrix storage formats were considered. Executions for both single and double precision arithmetics were also performed. We have found that SpMV is very sensitive to the application of reordering techniques on GPUs. In particular, several characteristics of the reordered matrices that have a big impact on the SpMV performance have been detected. In most of the cases, reordered matrices outperform the original ones, showing noticeable speedups up to 2.6x. We have also observed that there is no one storage format preferred over the others.

Bioinformatics | 2015

BigBWA: approaching the Burrows–Wheeler aligner to Big Data technologies

José Manuel Abuín; Juan Carlos Pichel; Tomás F. Pena; Jorge Amigo

UNLABELLED BigBWA is a new tool that uses the Big Data technology Hadoop to boost the performance of the Burrows-Wheeler aligner (BWA). Important reductions in the execution times were observed when using this tool. In addition, BigBWA is fault tolerant and it does not require any modification of the original BWA source code. AVAILABILITY AND IMPLEMENTATION BigBWA is available at the project GitHub repository: https://github.com/citiususc/BigBWA.

parallel computing | 2005

Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Juan Carlos Pichel; Dora Blanco Heras; José Carlos Cabaleiro; Francisco F. Rivera

The combination of techniques based on reordering data with classic code restructuring techniques for increasing the locality in the execution of sparse algebra codes is studied in this paper. The reordering techniques are based on, first modeling the locality in run-time, and then applying a heuristic for increasing it. After this, a code restructuring technique specially tuned for sparse algebra codes called register blocking is applied. The product of a sparse matrix by a dense vector (SpMxV) is the code studied on different monoprocessors and distributed memory multiprocessors. The combination of both techniques was tested for a broad set of matrices from real problems and known repositories. The results expressed in terms of execution time show that an adequate reordering of the data improves the efficiency of applying register blocking, therefore, reducing the execution time for the sparse algebra code considered.

PLOS ONE | 2016

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

José Manuel Abuín; Juan Carlos Pichel; Tomás F. Pena; Jorge Amigo

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

high performance computing and communications | 2008

Reordering Algorithms for Increasing Locality on Multicore Processors

Juan Carlos Pichel; David E. Singh; Jesús Carretero

In order to efficiently exploit available parallelism, multicore processors must address contention for shared resources as cache hierarchy. This fact becomes even more important when irregular codes are executed on them, which is the case for sparse matrix ones. In this paper a technique for increasing locality of sparse matrix codes on multicore platforms is presented. The technique consists on reorganizing the data guided by a locality model which introduces the concept of windows of locality. The evaluation of the reordering technique has been performed on two different leading multicore platforms: Intel Core2Duo and Intel Xeon. Experimental results show important performance improvements when using our reordered matrices with respect to original ones. In particular, an average execution time reduction of about 30% is achieved considering different number of running threads. These results are due to an improved overall cache behavior. Likewise, a comparison of our proposal with some standard reordering techniques is included in the paper. Results point out that the reordering technique always outperforms standard algorithms and is effective for matrices with any structure.

Journal of Parallel and Distributed Computing | 2013

Sparse matrix-vector multiplication on the Single-Chip Cloud Computer many-core processor

Juan Carlos Pichel; Francisco F. Rivera

The microprocessor industry has responded to memory, power and ILP walls by turning to many-core processors, increasing parallelism as the primary method to improve processor performance. These processors are expected to consist of tens or even hundreds of cores. One of these future processors is the 48-core experimental processor Single-Chip Cloud Computer (SCC). The SCC was created by Intel Labs as a platform for many-core software research. In this work we study the behavior of an important irregular application such as the Sparse Matrix-Vector multiplication (SpMV) on the SCC processor in terms of performance and power efficiency. In addition, some of the most successful optimization techniques for this kernel are evaluated. In particular, reordering, blocking and data compression techniques have been considered. Our experiments give some key insights that can serve as guidelines for the understanding and optimization of the SpMV kernel on this architecture. Furthermore, an architectural comparison of the SCC processor with several leading multicore processors and GPUs is performed, including the new Intel Xeon Phi coprocessor. The SCC only outperforms the Itanium2 multicore processor. Best performance results are observed for the high-end GPUs and the Phi, while reaching low values with respect to their peak performance. In terms of power efficiency, we must highlight the good behavior of the ATI GPUs.

The Journal of Supercomputing | 2009

A collective I/O implementation based on inspector---executor paradigm

David E. Singh; Florin Isaila; Juan Carlos Pichel; Jesús Carretero

In this paper, we present a novel multiple phase I/O collective technique for generic block-cyclic distributions. The I/O technique is divided into two stages: inspector and executor. During the inspector stage, the communication pattern is computed and the required datatypes are automatically generated. This information is used during the executor stage in performing the communication and file accesses. The two stages are decoupled, so that for repetitive file access patterns, the computations from the inspector stage can be performed once and reused several times by the executor. This strategy allows to amortize the inspector cost over several I/O operations. In this paper, we evaluate the performance of multiple phase I/O collective technique and we compare it with other state of the art approaches. Experimental results show that for small access granularities, our method outperforms in the large majority of cases other parallel I/O optimizations techniques.

high performance computing for computational science (vector and parallel processing) | 2008

Data Locality Aware Strategy for Two-Phase Collective I/O

Rosa Filgueira; David E. Singh; Juan Carlos Pichel; Florin Isaila; Jesús Carretero

This paper presents Locality-Aware Two-Phase (LATP) I/O, an optimization of the Two-Phase collective I/O technique from ROMIO, the most popular MPI-IO implementation. In order to increase the locality of the file accesses, LATP employs the Linear Assignment Problem (LAP) for finding an optimal distribution of data to processes, an aspect that is not considered in the original technique. This assignment is based on the local data that each process stores and has as main purpose the reduction of the number of communication involved in the I/O collective operation and, therefore, the improvement of the global execution time. Compared with Two-Phase I/O, LATP I/O obtains important improvements in most of the considered scenarios.

international symposium on parallel architectures algorithms and networks | 2005

A new technique to reduce false sharing in parallel irregular codes based on distance functions

Juan Carlos Pichel; Dora Blanco Heras; José Carlos Cabaleiro; Francisco F. Rivera

In this paper a technique to deal with the problem of poor locality and false sharing in irregular codes on shared memory multiprocessors (SMPs) is proposed. This technique is based on the locality model for irregular codes previously developed and extensively proven by the authors on monoprocessors and multiprocessors. In the model, locality is established in run-time considering parameters that describe the structure of the sparse matrix which characterizes the irregular accesses. As an example of irregular code with false sharing a particular implementation of the sparse matrix-vector product (SpM/spl times/V) was selected. The problem of increasing locality and decreasing false sharing for a irregular problem is formulated as a graph. An adequate distribution of the graph among processors followed by a reordering of the nodes inside each processor produces the solution. The results show important improvements in the behavior of the irregular accesses: reductions in execution time and an improved program scalability.

Explore More