Francisco D. Igual | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Francisco D. Igual is active.

Explore More

Publication

Featured researches published by Francisco D. Igual.

european conference on parallel processing | 2009

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Eduard Ayguadé; Rosa M. Badia; Francisco D. Igual; Jesús Labarta; Rafael Mayo; Enrique S. Quintana-Ortí

While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmers productivity will determine their success in the high-performance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results.

acm sigplan symposium on principles and practice of parallel programming | 2009

Solving dense linear systems on platforms with multiple hardware accelerators

Gregorio Quintana-Ortí; Francisco D. Igual; Enrique S. Quintana-Ortí; Robert A. van de Geijn

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.

european conference on parallel processing | 2008

Solving Dense Linear Systems on Graphics Processors

Sergio Barrachina; Maribel Castillo; Francisco D. Igual; Rafael Mayo; Enrique S. Quintana-Ortí

We present several algorithms to compute the solution of a linear system of equations on a GPU, as well as general techniques to improve their performance, such as padding and hybrid GPU-CPU computation. We also show how iterative refinement with mixed-precision can be used to regain full accuracy in the solution of linear systems. Experimental results on a G80 using CUBLAS 1.0, the implementation of BLAS for NVIDIA® GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed.

international workshop on openmp | 2009

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

Eduard Ayguadé; Rosa M. Badia; Daniel Cabrera; Alejandro Duran; Marc Gonzàlez; Francisco D. Igual; Daniel Jimenez; Jesús Labarta; Xavier Martorell; Rafael Mayo; Josep M. Perez; Enrique S. Quintana-Ortí

OpenMP has evolved recently towards expressing unstructured parallelism, targeting the parallelization of a broader range of applications in the current multicore era. Homogeneous multicore architectures from major vendors have become mainstream, but with clear indications that a better performance/power ratio can be achieved using more specialized hardware (accelerators), such as SSE-based units or GPUs, clearly deviating from the easy-to-understand shared-memory homogeneous architectures. This paper investigates if OpenMP could still survive in this new scenario and proposes a possible way to extend the current specification to reasonably integrate heterogeneity while preserving simplicity and portability. The paper leverages on a previous proposal that extended tasking with dependencies. The runtime is in charge of data movement, tasks scheduling based on these data dependencies and the appropriate selection of the target accelerator depending on system configuration and resource availability.

international parallel and distributed processing symposium | 2008

Evaluation and tuning of the Level 3 CUBLAS for graphics processors

Sergio Barrachina; María Isabel Castillo; Francisco D. Igual; Rafael Mayo; Enrique S. Quintana-Ortí

The increase in performance of the last generations of graphics processors (GPUs) has made this class of platform a coprocessing tool with remarkable success in certain types of operations. In this paper we evaluate the performance of the Level 3 operations in CUBLAS, the implementation of BIAS for NVIDIAreg GPUs with unified architecture. From this study, we gain insights on the quality of the kernels in the library and we propose several alternative implementations that are competitive with those in CUBLAS. Experimental results on a GeForce 8800 Ultra compare the performance of CUBLAS and the new variants.

International Journal of Parallel Programming | 2010

Extending OpenMP to Survive the Heterogeneous Multi-Core Era

Eduard Ayguadé; Rosa M. Badia; Pieter Bellens; Daniel Cabrera; Alejandro Duran; Roger Ferrer; Marc Gonzàlez; Francisco D. Igual; Daniel Jiménez-González; Jesus Labarta; Luis Martinell; Xavier Martorell; Rafael Mayo; Josep M. Perez; Judit Planas; Enrique S. Quintana-Ortí

This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.

Concurrency and Computation: Practice and Experience | 2009

Exploiting the capabilities of modern GPUs for dense matrix computations: EXPLOITING THE CAPABILITIES OF MODERN GPUS

Sergio Barrachina; Maribel Castillo; Francisco D. Igual; Rafael Mayo; Enrique S. Quintana-Ortí; Gregorio Quintana-Ortí

We present several algorithms to compute the solution of a linear system of equations on a graphics processor (GPU), as well as general techniques to improve their performance, such as padding and hybrid GPU‐CPU computation. We compare single and double precision performance of a modern GPU with unified architecture, and show how iterative refinement with mixed precision can be used to regain full accuracy in the solution of linear systems, exploiting the potential of the processor for single precision arithmetic. Experimental results on a GTX280 using CUBLAS 2.0, the implementation of BLAS for NVIDIA® GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed. Copyright

ACM Transactions on Mathematical Software | 2016

The BLIS Framework: Experiments in Portability

Field G. Van Zee; Tyler M. Smith; Bryan Marker; Tze Meng Low; Robert A. van de Geijn; Francisco D. Igual; Mikhail Smelyanskiy; Xianyi Zhang; Michael Kistler; Vernon Austel; John A. Gunnels; Lee Killough

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD’s ACML, IBM’s ESSL, and Intel’s MKL libraries. Although most of this article focuses on single-core implementation, we also provide compelling results that suggest the framework’s leverage extends to the multithreaded domain.

parallel processing and applied mathematics | 2009

Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures

Paolo Bientinesi; Francisco D. Igual; Daniel Kressner; Enrique S. Quintana-Ortí

We investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem, on general-purpose multicore processors. In response to the advances of hardware accelerators, we also modify the code in SBR to accelerate the computation by offloading a significant part of the operations to a graphics processor (GPU). Performance results illustrate the parallelism and scalability of these algorithms on current high-performance multi-core architectures.

symposium on computer architecture and high performance computing | 2012

Level-3 BLAS on the TI C6678 Multi-core DSP

Murtaza Ali; Eric J. Stotzer; Francisco D. Igual; Robert A. van de Geijn

Digital Signal Processors (DSP) are commonly employed in embedded systems. The increase of processing needs in cellular base-stations, radio controllers and industrial/medical imaging systems, has led to the development of multi-core DSPs as well as inclusion of floating point operations while maintaining low power dissipation. The eight-core DSP from Texas Instruments, codenamed TMS320C6678, provides a peak performance of 128 GFLOPS (single precision) and an effective 32 GFLOPS(double precision) for only 10 watts. In this paper, we present the first complete implementation and report performance of the Level-3 Basic Linear Algebra Subprograms(BLAS) routines for this DSP. These routines are first optimized for single core and then parallelized over the different cores using OpenMP constructs. The results show that we can achieve about 8 single precision GFLOPS/watt and 2.2double precision GFLOPS/watt for General Matrix-Matrix multiplication (GEMM). The performance of the rest of theLevel-3 BLAS routines is within 90% of the corresponding GEMM routines.

Explore More