Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where José R. Herrero is active.

Publication


Featured researches published by José R. Herrero.


Concurrency and Computation: Practice and Experience | 2009

Parallelizing dense and banded linear algebra libraries using SMPSs: PARALLELIZING LINEAR ALGEBRA LIBRARIES USING SMPSS

Rosa M. Badia; José R. Herrero; Jesús Labarta; Josep M. Perez; Enrique S. Quintana-Ortí; Gregorio Quintana-Ortí

The promise of future many‐core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP‐like pragmas and a run‐time system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS‐level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column‐major order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at run‐time. The parallelization of FLAME routines using SMPSs is simpler as this library includes blocked algorithms (or algorithms‐by‐blocks in the FLAME argot) for most operations and storage‐by‐blocks (or block data layout) is already in place. Copyright


Applied Mathematics Letters | 2012

On new computational local orders of convergence

Miquel Grau-Sánchez; Miquel Noguera; Àngela Grau; José R. Herrero

Abstract Four new variants of the Computational Order of Convergence (COC) of a one-point iterative method with memory for solving nonlinear equations are presented. Furthermore, the way to approximate the new variants to the local order of convergence is analyzed. Three of the new definitions given here do not involve the unknown root. Numerical experiments using adaptive arithmetic with multiple precision and a stopping criteria are implemented without using any known root.


Applicable Algebra in Engineering, Communication and Computing | 2007

Analysis of a sparse hypermatrix Cholesky with fixed-sized blocking

José R. Herrero; Juan J. Navarro

We present the way in which we have constructed an implementation of a sparse Cholesky factorization based on a hypermatrix data structure. This data structure is a storage scheme which produces a recursive 2D partitioning of a sparse matrix. It can be useful on some large sparse matrices. Subblocks are stored as dense matrices. Thus, efficient BLAS3 routines can be used. However, since we are dealing with sparse matrices some zeros may be stored in those dense blocks. The overhead introduced by the operations on zeros can become large and considerably degrade performance. We present the ways in which we deal with this overhead. Using matrices from different areas (Interior Point Methods of linear programming and Finite Element Methods), we evaluate our sequential in-core hypermatrix sparse Cholesky implementation. We compare its performance with several other codes and analyze the results. In spite of using a simple fixed-size partitioning of the matrix our code obtains competitive performance.


european conference on parallel processing | 2003

Improving Performance of Hypermatrix Cholesky Factorization

José R. Herrero; Juan J. Navarro

This paper shows how a sparse hypermatrix Cholesky factorization can be improved. This is accomplished by means of efficient codes which operate on very small dense matrices. Different matrix sizes or target platforms may require different codes to obtain good performance. We write a set of codes for each matrix operation using different loop orders and unroll factors. Then, for each matrix size, we automatically compile each code fixing matrix leading dimensions and loop sizes, run the resulting executable and keep its Mflops. The best combination is then used to produce the object introduced in a library. Thus, a routine for each desired matrix size is available from the library. The large overhead incurred by the hypermatrix Cholesky factorization of sparse matrices can therefore be lessened by reducing the block size when those routines are used. Using the routines, e.g. matrix multiplication, in our small matrix library produced important speed-ups in our sparse Cholesky code.


international conference on supercomputing | 1996

Data prefetching and multilevel blocking for linear algebra operations

Juan J. Navarro; Elena García-Diego; José R. Herrero

Much effort has been directed towards obtaining near peak performance for linear algebra operations on current high performance workstations. The large amounts of data accesses however, make performance highly dependent on the behavior of the memory hierarchy. Techniques such as Ivfttltilevel Blocking (Tiling), Data Precopying, Software Pipelining and Software Prefetching have been applied in order to improve performance. Nevertheless, to our knowledge, no other work has been done considering the relation between these techniques when applied together. In this paper we analyze the behavior of matrix multiplication algorithms for large matrices on a superscalar and superpipelined processor with a multilevel memory hierarchy when these techniques are applied together. We study and model the performance and limitations of different codes. We also compare two different approaches to data prefetching, binding versus non-binding, and find the latter remarkably more effective than the former due mainly to its flexibility. Results are obtained on a workstation with 200 MFlops peak performance. The initial 15 MFlops of the simple jkz’ form can be improved up to 128 MFlops when Multilevel Blocking (at the register level and 2 cache levels), Data Precopies and Software Pipelining techniques are used. Computations can be further speeded up to 166 MFlops when Nonbinding Prefetch is combined with all of the previous techniques.


Pattern Analysis and Applications | 2007

Exploiting computer resources for fast nearest neighbor classification

José R. Herrero; Juan J. Navarro

Modern computers provide excellent opportunities for performing fast computations. They are equipped with powerful microprocessors and large memories. However, programs are not necessarily able to exploit those computer resources effectively. In this paper, we present the way in which we have implemented a nearest neighbor classification. We show how performance can be improved by exploiting the ability of superscalar processors to issue multiple instructions per cycle and by using the memory hierarchy adequately. This is accomplished by the use of floating-point arithmetic which usually outperforms integer arithmetic, and block (tiled) algorithms which exploit the data locality of programs allowing for an efficient use of the data stored in the cache memory. Our results are validated with both an analytical model and empirical results. We show that regular codes could be performed faster than more complex irregular codes using standard data sets.


international conference on computational science and its applications | 2006

Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels

José R. Herrero; Juan J. Navarro

The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. In this paper, however, we present an alternative way to produce efficient matrix multiplication kernels based on a set of simple codes which can be parameterized at compilation time. Using the resulting kernels we have been able to produce high performance sparse and dense linear algebra codes on a variety of platforms.


parallel processing and applied mathematics | 2007

New data structures for matrices and specialized inner kernels: low overhead for high performance

José R. Herrero

Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the potential of non-canonical data structures for dense linear algebra can be better exploited with the use of specialized inner kernels. The use of non-canonical data structures together with specialized inner kernels has low overhead and can produce excellent performance.


parallel computing | 2006

Using non-canonical array layouts in dense matrix operations

José R. Herrero; Juan J. Navarro

We present two implementations of dense matrix multiplication based on two different non-canonical array layouts: one based on a hypermatrix data structure (HM) where data submatrices are stored using a recursive layout; the other based on a simple block data layout with square blocks (SB) where blocks are arranged in column-major order. We show that the iterative code using SB outperforms a recursive code using HM and obtains competitive results on a variety of platforms.


workshop on computer architecture education | 2015

Tareador: a tool to unveil parallelization strategies at undergraduate level

Eduard Ayguadé; Rosa M. Badia; Daniel Jimenez; José R. Herrero; Jesús Labarta; Vladimir Subotic; Gladys Utrera

This paper presents a methodology and framework designed to assist students in the process of finding appropriate task decomposition strategies for their sequential program, as well as identifying bottlenecks in the later execution of the parallel program. One of the main components of this framework is Tareador, which provides a simple API to specify potential task decomposition strategies for a sequential program. Once the student proposes how to break the sequential code into tasks, Tareador 1) provides information about the dependences between tasks that should be honored when implementing that task decomposition using a parallel programming model; and 2) estimates the potential parallelism that could be achieved in an ideal parallel architecture with infinite processors; and 3) simulates the parallel execution on an ideal architecture estimating the potential speed--up that could be achieved on a number of processors. The pedagogical style of the methodology is currently applied to teach parallelism in a third-year compulsory subject in the Bachelor Degree in Informatics Engineering at the Barcelona School of Informatics of the Universitat Politècnica de Catalunya (UPC) - BarcelonaTech.

Collaboration


Dive into the José R. Herrero's collaboration.

Top Co-Authors

Avatar

Juan J. Navarro

Polytechnic University of Catalonia

View shared research outputs
Top Co-Authors

Avatar

David López

Polytechnic University of Catalonia

View shared research outputs
Top Co-Authors

Avatar

Eduard Ayguadé

Barcelona Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar

Jesús Labarta

Barcelona Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar

Rosa M. Badia

Barcelona Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Diego Marron

Barcelona Supercomputing Center

View shared research outputs
Top Co-Authors

Avatar

Fermín Sánchez

Polytechnic University of Catalonia

View shared research outputs
Researchain Logo
Decentralizing Knowledge