José Carlos Cabaleiro

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José Carlos Cabaleiro is active.

Explore More

Publication

Featured researches published by José Carlos Cabaleiro.

parallel, distributed and network-based processing | 2004

Improving the locality of the sparse matrix-vector product on shared memory multiprocessors

Juan Carlos Pichel; Dora Blanco Heras; José Carlos Cabaleiro; Francisco F. Rivera

We extend a model of locality and the subsequent process of locality improvement previously developed for the case of sparse algebra codes in monoprocessors to the case of NUMA shared memory multiprocessors (SMPs). In particular the product of a sparse matrix by a dense vector (SpM/spl times/V) is studied. In the model, locality is established at run-time considering parameters that describe the structure of the sparse matrix involved in the computations. The problem of increasing the locality is formulated as a graph problem, whose solution indicates some appropriate reordering of rows and columns of the sparse matrix. The reordering algorithms were tested for a broad set of matrices. We have also performed a comparison with other reordering algorithms. The results lead to general conclusions about improving SMP performance for other sparse algebra codes.

parallel computing | 2005

Performance optimization of irregular codes based on the combination of reordering and blocking techniques

Juan Carlos Pichel; Dora Blanco Heras; José Carlos Cabaleiro; Francisco F. Rivera

The combination of techniques based on reordering data with classic code restructuring techniques for increasing the locality in the execution of sparse algebra codes is studied in this paper. The reordering techniques are based on, first modeling the locality in run-time, and then applying a heuristic for increasing it. After this, a code restructuring technique specially tuned for sparse algebra codes called register blocking is applied. The product of a sparse matrix by a dense vector (SpMxV) is the code studied on different monoprocessors and distributed memory multiprocessors. The combination of both techniques was tested for a broad set of matrices from real problems and known repositories. The results expressed in terms of execution time show that an adequate reordering of the data improves the efficiency of applying register blocking, therefore, reducing the execution time for the sparse algebra code considered.

parallel computing | 2001

Modeling data locality for the sparse matrix-vector product using distance measures

Dora Blanco Heras; José Carlos Cabaleiro; Francisco F. Rivera

Abstract In this work, we model the data locality in the execution of codes with irregular accesses. We focus on the product of a sparse matrix by a dense vector (SpM×V). In the model, locality is established taking into account pairs of rows or columns of sparse matrices. In order to evaluate this locality three functions are introduced based on two parameters: number of entry matches and number of block matches. The model is generalized considering windows of locality (groups of consecutive rows/columns of the matrix). We show results for a broad set of matrices measuring the goodness of our predictions of locality.

Advances in Engineering Software | 2002

Parallel iterative solvers involving fast wavelet transforms for the solution of BEM systems

Patricia González; José Carlos Cabaleiro; Tomás F. Pena

This paper describes the parallelization of a strategy to speed up the convergence of iterative methods applied to boundary element method (BEM) systems arising from problems with non-smooth boundaries and mixed boundary conditions. The aim of the work is the application of fast wavelet transforms as a black box transformation in existing boundary element codes. A new strategy was proposed, applying wavelet transforms on the interval, so it could be used in case of non-smooth coefficient matrices. Here, we describe the parallel iterative scheme and we present some of the results we have obtained.

Future Generation Computer Systems | 2001

Modeling and improving locality for the sparse-matrix-vector product on cache memories

Dora Blanco Heras; Vicente Blanco; José Carlos Cabaleiro; Francisco F. Rivera

Abstract A model for representing and improving the locality exhibited by the execution of sparse irregular problems is developed in this work. We focus on the product of a sparse matrix by a dense vector (SpM×V). We consider the cache memory as a representative level of the memory hierarchy. Locality is evaluated through four functions based on two parameters called entry matches and line matches. In order to increase the locality, two algorithms are applied: one based on the construction of minimum spanning trees and the other on the nearest-neighbor heuristic. These techniques were tested and compared with some standard ordering algorithms.

The Journal of Supercomputing | 2001

Parallel Computation of Wavelet Transforms Using the Lifting Scheme

Patricia González; José Carlos Cabaleiro; Tomás F. Pena

The lifting scheme [14] is a method for construction of biorthogonal wavelets and fast computation of the corresponding wavelet transforms. This paper describes a message–passing parallel implementation in which high efficiency is achieved by a modified data–swapping approach allowing communications to overlap computations. The method illustrated by application to Haar and Daubechies (D4) wavelets. Timing and speed–up results for the Cray T3E and the Fujitsu AP3000 are presented.

Advances in Engineering Software | 2000

Dual BEM for crack growth analysis on distributed-memory multiprocessors

P. González; Tomás F. Pena; José Carlos Cabaleiro; Francisco F. Rivera

Abstract The Dual Boundary Element Method (DBEM) has been presented as an effective numerical technique for the analysis of linear elastic crack problems [Portela A, Aliabadi MH. The dual boundary element method: effective implementation for crack problems. Int J Num Meth Engng 1992;33:1269–1287]. Analysis of large structural integrity problems may need the use of large computational resources, both in terms of CPU time and memory requirements. This paper reports a message-passing implementation of the DBEM formulation dealing with the analysis of crack growth in structures. We have analyzed the construction of the system and its resolution. Different data distribution techniques have been studied with several problems. Results in terms of scalability and load balance for these two stages are presented in this paper.

international parallel and distributed processing symposium | 2009

Accurate analytical performance model of communications in MPI applications

Diego Martínez; José Carlos Cabaleiro; Tomás F. Pena; Francisco F. Rivera; Vicente Blanco

This paper presents a new LogP-based model, called LoOgGP, which allows an accurate characterization of MPI applications based on microbenchmark measurements. This new model is an extension of LogP for long messages in which both overhead and gap parameters perform a linear dependency with message size. The LoOgGP model has been fully integrated into a modelling framework to obtain statistical models of parallel applications, providing the analyst with an easy and automatic tool for LoOgGP parameter set assessment to characterize communications. The use of LoOgGP model to obtain a statistical performance model of an image deconvolution application is illustrated as a case of study.

Future Generation Computer Systems | 2003

AVISPA : visualizing the performance prediction of parallel iterative solvers

Vicente Blanco; Patricia González; José Carlos Cabaleiro; Dora Blanco Heras; Tomás F. Pena; Juan J. Pombo; Francisco F. Rivera

The selection of the best method and preconditioner for solving a sparse linear system is as determinant as the efficient parallelization of the selected method. We propose a tool for helping to solve both problems on distributed memory multiprocessors using iterative methods. Based on a previously developed library of HPF and message-passing interface (MPI) codes, a performance prediction is developed and a visualization tool (AVISP A) is proposed. The tool combines theoretical features of the methods and preconditioners with practical considerations and predictions about aspects of the execution performance (computational cost, communications overhead, etc.). It offers detailed information about all the topics that can be useful for selecting the most suitable method and preconditioner. Another capability is to offer information on different parallel implementations of the code (HPF and MPI) varying the number of available processors.

high performance computing for computational science (vector and parallel processing) | 1996

Principal Component Analysis on Vector Computers

Dora Blanco Heras; José Carlos Cabaleiro; Vicente Blanco Pérez; Pablo Costas; Francisco F. Rivera

Principal component analysis is a classical multivariate technique used as a basic tool in the field of image processing. Due to the iterative character and the high computational cost of these algorithms over conventional computers, they are good candidates for pipelined processing. In this work we analyse these algorithms from the viewpoint of vectorization and present an efficient implementation on the Fujitsu VP-2400/10. We systematically applied different code transformations to the algorithm making use of the vectorial capabilities of the system. In particular we have tested a number of vectorization techniques that optimize the reuse of the vector registers, exploit all levels of the memory hierarchy, and utilize the pipelined units in parallel (concurrency between them). We have considered images of 32×32 pixels and have divided the algorithm into three different stages. The speedups obtained for the native vectorizing compiler were 1.3, 1.3 and 7.9 for the different stages. These speedups were multiplied by factors of 5, 50 and 55 respectively, after applying our code transformations. The best improvement was achieved in the third stage of the algorithm, which is the most time consumming.

Explore More