Javier Cuenca | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Javier Cuenca is active.

Explore More

Publication

Featured researches published by Javier Cuenca.

parallel computing | 2004

Architecture of an automatically tuned linear algebra library

Javier Cuenca; Domingo Giménez; Jose Gonzalez

One approach for a hierarchical architecture of a set of linear algebra libraries with self-optimisation capacity is shown. In previous works the optimisation of several routines was studied separately, and in this work the ideas applied to individual routines are combined with the classical hierarchy of linear algebra libraries. Each self-optimised library consists of the former routines of the library and additional special routines which obtain information of the characteristics on the system and tune certain parameters of the former routines accordingly. The relationship between libraries of the different levels of the hierarchy is also strengthened. Just as each routine has in its code different calls to lower levels, so this routine will use the self-optimisation information of these other routines to generate its own information. Experiments with routines of different levels and on different kinds of platforms with constant, variable and heterogeneous load have been carried out. The results obtained allow us to conclude that the proposed methodology is valid for obtaining self-optimised linear algebra libraries.

parallel computing | 2005

Heuristics for work distribution of a homogeneous parallel dynamic programming scheme on heterogeneous systems

Javier Cuenca; Domingo Giménez; Juan-Pedro Martínez

In this paper the possibility of including automatic optimization techniques in the design of parallel dynamic programming algorithms in heterogeneous systems is analyzed. The main idea is to automatically approach the optimum values of a number of algorithmic parameters (number of processes, number of processors, processes per processor), and thus obtain low execution times. Hence, users could be provided with routines which execute efficiently, and independently of the experience of the user in heterogeneous computing and dynamic programming, and which can adapt automatically to a new network of processors or a new network configuration.

parallel distributed and network based processing | 2002

Towards the design of an automatically tuned linear algebra library

Javier Cuenca; Domingo Giménez; José González

In this work we propose the architecture of an automatically tuned linear algebra library, which is composed by a set of linear algebra routines along with their installation routines. During the installation process on a system, the linear algebra routines will be tuned automatically to the system conditions: hardware characteristics and basic libraries used in the linear algebra routines. The design methodology is analysed with a block LU factorisation. Variants for a sequential and parallel version of this, routine on a logical rectangular mesh of processors are, considered. An analytical model of the algorithm is developed as the basis of our methodology, and the behaviour of the algorithm is analysed with message-passing using MPI on several platforms: Network of SUN workstations, SGI Origin 2000 and IBM SP2, and with, different basic linear algebra libraries: reference BLAS, machine-specific BLAS and ATLAS. The experiments show that it is possible to make a good automatic choice of configurable parameters of the linear algebra routines during the installation process. The average execution time of the linear algebra routine is reduced by about 15% with respect to the non-tuned version.

international conference on cluster computing | 2005

Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters

Javier Cuenca; Luis-Pedro García; Domingo Giménez; Jack J. Dongarra

This paper presents a self-optimization methodology for parallel linear algebra routines on heterogeneous systems. For each routine, a series of decisions is taken automatically in order to obtain an execution time close to the optimum (without rewriting the routines code). Some of these decisions are: the number of processes to generate, the heterogeneous distribution of these processes over the network of processors, the logical topology of the generated processes,... To reduce the search space of such decisions, different heuristics have been used. The experiments have been performed with a parallel LU factorization routine similar to the ScaLAPACK one, and good results have been obtained on different heterogeneous platforms

international conference on conceptual structures | 2013

Optimization Techniques for 3D-FWT on Systems with Manycore GPUs and Multicore CPUs☆

Gregorio Bernabé; Javier Cuenca; Domingo Giménez

Abstract Programming manycore GPUs or multicore CPUs for high performance requires a careful balance of several hardware specific related factors, which is typically achieved by expert users through trial and error. To reduce the amount of hand-made optimization time required to achieve optimal performance, general guidelines can be followed or different metrics can be considered to predict performance, but ultimately a trial and error process is still prevalent. In this paper, we present an optimization method to run the 3D-Fast Wavelet Transform (3D-FWT) on hybrid systems. The optimization engine detects the different platforms found on a system, executing the appropriate kernel, implemented in both CUDA or OpenCL for GPUs, and programmed with pthreads for a CPU. Moreover, the proposed method selects automatically parameters such as the block size, the work-group size or the number of threads for reducing the execution time, obtaining the optimal performance in many cases. Finally, the optimization engine sends proportionally different parts of a video sequence to run concurrently in all platforms of the system. Speedups with respect to a normal user, who sends all frames to a GPU with a version of the 3D-FWT implemented in CUDA or OpenCL, presents an averaged gains of up to 7.93.

ieee international conference on high performance computing data and analytics | 2004

Designing polylibraries to speed up linear algebra computations

Pedro V. Alberti; Pedro Alonso; Antonio M. Vidal; Javier Cuenca; Domingo Giménez

In this paper, we analyse the design of polylibraries, where the programs call for routines from different libraries according to the characteristics of the problem and of the system used to solve it. An architecture for this type of library is proposed. Our aim is to develop a methodology which can be used in the design of parallel libraries. To evaluate the viability of the proposed method, the typical linear algebra libraries hierarchy has been considered. Experiments have been performed in different systems and with linear algebra routines from different levels of the hierarchy. The results confirm the design of polylibraries as a good technique for speeding up computations.

euromicro workshop on parallel and distributed processing | 2001

Modeling the behaviour of linear algebra algorithms with message-passing

Javier Cuenca; Domingo Giménez; José González

Modeling the behaviour of linear algebra algorithms is very suitable for designing linear algebra software for high performance computers. This modelization would enable us to predict the execution time of the routines depending on a number of parameters. There are two groups of parameters, in the first, there are the parameters whose values can be chosen by the user: number of processors, processors grid configuration, distribution of data in the system, block size; and in the second, we have the parameters that specify the characteristics of a target architecture: arithmetic cost and start-up and word-sending cost of a communication operation. Thus, a linear algebra library could be designed in such a way that each routine takes the values of the parameters of the first group that provide the expected optimum execution time, and solves the problem. This library could, therefore be employed by a non-expert user to solve scientific or engineering problems, because the user does not need to determine the values of these parameters. The design methodology is analysed with one-sided block Jacobi methods to solve the symmetric eigenvalue problem. Variants for a logical ring and a logical rectangular mesh of processors are considered. An analytical model of the algorithm is developed, and the behaviour of the algorithm is analysed with message-passing using MPI in a SGI Origin 2000. With the parameters chosen by our model, the execution time is reduced from about 50% higher than the optimal to just 2%.

international conference on parallel processing | 2003

Empirical Modelling of Parallel Linear Algebra Routines

Javier Cuenca; Luis-Pedro García; Domingo Giménez; Jose Gonzalez; Antonio M. Vidal

This paper shows some ways of combining empirical studies with the theoretical modellization of parallel linear algebra routines. With this combination the accuracy of the model is improved, and the model can be used to take some decisions which facilitate the reduction of the execution time. Experiments with the QR and Cholesky factorizations are shown.

parallel computing | 2014

Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

Jesús Cámara; Javier Cuenca; Luis-Pedro García; Domingo Giménez

The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix-matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BLAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP.

International Journal of Parallel Programming | 2014

Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning

Jesús Cámara; Javier Cuenca; Domingo Giménez; Luis Pedro García; Antonio M. Vidal

The introduction of auto-tuning techniques in linear algebra shared-memory routines is analyzed. Information obtained in the installation of the routines is used at running time to take some decisions to reduce the total execution time. The study is carried out with routines at different levels (matrix multiplication, LU and Cholesky factorizations and linear systems symmetric or general routines) and with calls to routines in the LAPACK and PLASMA libraries with multithread implementations. Medium NUMA and large cc-NUMA systems are used in the experiments. This variety of routines, libraries and systems allows us to obtain general conclusions about the methodology to use for linear algebra shared-memory routines auto-tuning. Satisfactory execution times are obtained with the proposed methodology.

Explore More