José R. Herrero
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by José R. Herrero.
Concurrency and Computation: Practice and Experience | 2009
Rosa M. Badia; José R. Herrero; Jesús Labarta; Josep M. Perez; Enrique S. Quintana-Ortí; Gregorio Quintana-Ortí
The promise of future many‐core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP‐like pragmas and a run‐time system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS‐level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column‐major order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at run‐time. The parallelization of FLAME routines using SMPSs is simpler as this library includes blocked algorithms (or algorithms‐by‐blocks in the FLAME argot) for most operations and storage‐by‐blocks (or block data layout) is already in place. Copyright
Applied Mathematics Letters | 2012
Miquel Grau-Sánchez; Miquel Noguera; Àngela Grau; José R. Herrero
Abstract Four new variants of the Computational Order of Convergence (COC) of a one-point iterative method with memory for solving nonlinear equations are presented. Furthermore, the way to approximate the new variants to the local order of convergence is analyzed. Three of the new definitions given here do not involve the unknown root. Numerical experiments using adaptive arithmetic with multiple precision and a stopping criteria are implemented without using any known root.
Applicable Algebra in Engineering, Communication and Computing | 2007
José R. Herrero; Juan J. Navarro
We present the way in which we have constructed an implementation of a sparse Cholesky factorization based on a hypermatrix data structure. This data structure is a storage scheme which produces a recursive 2D partitioning of a sparse matrix. It can be useful on some large sparse matrices. Subblocks are stored as dense matrices. Thus, efficient BLAS3 routines can be used. However, since we are dealing with sparse matrices some zeros may be stored in those dense blocks. The overhead introduced by the operations on zeros can become large and considerably degrade performance. We present the ways in which we deal with this overhead. Using matrices from different areas (Interior Point Methods of linear programming and Finite Element Methods), we evaluate our sequential in-core hypermatrix sparse Cholesky implementation. We compare its performance with several other codes and analyze the results. In spite of using a simple fixed-size partitioning of the matrix our code obtains competitive performance.
european conference on parallel processing | 2003
José R. Herrero; Juan J. Navarro
This paper shows how a sparse hypermatrix Cholesky factorization can be improved. This is accomplished by means of efficient codes which operate on very small dense matrices. Different matrix sizes or target platforms may require different codes to obtain good performance. We write a set of codes for each matrix operation using different loop orders and unroll factors. Then, for each matrix size, we automatically compile each code fixing matrix leading dimensions and loop sizes, run the resulting executable and keep its Mflops. The best combination is then used to produce the object introduced in a library. Thus, a routine for each desired matrix size is available from the library. The large overhead incurred by the hypermatrix Cholesky factorization of sparse matrices can therefore be lessened by reducing the block size when those routines are used. Using the routines, e.g. matrix multiplication, in our small matrix library produced important speed-ups in our sparse Cholesky code.
international conference on supercomputing | 1996
Juan J. Navarro; Elena García-Diego; José R. Herrero
Much effort has been directed towards obtaining near peak performance for linear algebra operations on current high performance workstations. The large amounts of data accesses however, make performance highly dependent on the behavior of the memory hierarchy. Techniques such as Ivfttltilevel Blocking (Tiling), Data Precopying, Software Pipelining and Software Prefetching have been applied in order to improve performance. Nevertheless, to our knowledge, no other work has been done considering the relation between these techniques when applied together. In this paper we analyze the behavior of matrix multiplication algorithms for large matrices on a superscalar and superpipelined processor with a multilevel memory hierarchy when these techniques are applied together. We study and model the performance and limitations of different codes. We also compare two different approaches to data prefetching, binding versus non-binding, and find the latter remarkably more effective than the former due mainly to its flexibility. Results are obtained on a workstation with 200 MFlops peak performance. The initial 15 MFlops of the simple jkz’ form can be improved up to 128 MFlops when Multilevel Blocking (at the register level and 2 cache levels), Data Precopies and Software Pipelining techniques are used. Computations can be further speeded up to 166 MFlops when Nonbinding Prefetch is combined with all of the previous techniques.
Pattern Analysis and Applications | 2007
José R. Herrero; Juan J. Navarro
Modern computers provide excellent opportunities for performing fast computations. They are equipped with powerful microprocessors and large memories. However, programs are not necessarily able to exploit those computer resources effectively. In this paper, we present the way in which we have implemented a nearest neighbor classification. We show how performance can be improved by exploiting the ability of superscalar processors to issue multiple instructions per cycle and by using the memory hierarchy adequately. This is accomplished by the use of floating-point arithmetic which usually outperforms integer arithmetic, and block (tiled) algorithms which exploit the data locality of programs allowing for an efficient use of the data stored in the cache memory. Our results are validated with both an analytical model and empirical results. We show that regular codes could be performed faster than more complex irregular codes using standard data sets.
international conference on computational science and its applications | 2006
José R. Herrero; Juan J. Navarro
The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. In this paper, however, we present an alternative way to produce efficient matrix multiplication kernels based on a set of simple codes which can be parameterized at compilation time. Using the resulting kernels we have been able to produce high performance sparse and dense linear algebra codes on a variety of platforms.
parallel processing and applied mathematics | 2007
José R. Herrero
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the potential of non-canonical data structures for dense linear algebra can be better exploited with the use of specialized inner kernels. The use of non-canonical data structures together with specialized inner kernels has low overhead and can produce excellent performance.
parallel computing | 2006
José R. Herrero; Juan J. Navarro
We present two implementations of dense matrix multiplication based on two different non-canonical array layouts: one based on a hypermatrix data structure (HM) where data submatrices are stored using a recursive layout; the other based on a simple block data layout with square blocks (SB) where blocks are arranged in column-major order. We show that the iterative code using SB outperforms a recursive code using HM and obtains competitive results on a variety of platforms.
workshop on computer architecture education | 2015
Eduard Ayguadé; Rosa M. Badia; Daniel Jimenez; José R. Herrero; Jesús Labarta; Vladimir Subotic; Gladys Utrera
This paper presents a methodology and framework designed to assist students in the process of finding appropriate task decomposition strategies for their sequential program, as well as identifying bottlenecks in the later execution of the parallel program. One of the main components of this framework is Tareador, which provides a simple API to specify potential task decomposition strategies for a sequential program. Once the student proposes how to break the sequential code into tasks, Tareador 1) provides information about the dependences between tasks that should be honored when implementing that task decomposition using a parallel programming model; and 2) estimates the potential parallelism that could be achieved in an ideal parallel architecture with infinite processors; and 3) simulates the parallel execution on an ideal architecture estimating the potential speed--up that could be achieved on a number of processors. The pedagogical style of the methodology is currently applied to teach parallelism in a third-year compulsory subject in the Bachelor Degree in Informatics Engineering at the Barcelona School of Informatics of the Universitat Politècnica de Catalunya (UPC) - BarcelonaTech.