Francisco de Sande | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Francisco de Sande is active.

Explore More

Publication

Featured researches published by Francisco de Sande.

international conference on parallel processing | 2012

accULL: an OpenACC implementation with CUDA and OpenCL support

Ruym; n Reyes; Iv ; n López-Rodríguez; Juan J. Fumero; Francisco de Sande

The irruption in the HPC scene of hardware accelerators, like GPUs, has made available unprecedented performance to developers. However, even expert developers may not be ready to exploit the new complex processor hierarchies. We need to find a way to leverage the programming effort in these devices at programming language level, otherwise, developers will spend most of their time focusing on device-specific code instead of implementing algorithmic enhancements. The recent advent of the OpenACC standard for heterogeneous computing represents an effort in this direction. This initiative, combined with future releases of the OpenMP standard, will converge into a fully heterogeneous framework that will cope the programming requirements of future computer architectures. In this work we present accULL , a novel implementation of the OpenACC standard, based on the combination of a source to source compiler and a runtime library. To our knowledge, our approach is the first providing support for both OpenCL and CUDA platforms under this new standard.

ieee international conference on high performance computing data and analytics | 2012

Directive-based Programming for GPUs: A Comparative Study

Ruymán Reyes; Ivan Lopez; Juan J. Fumero; Francisco de Sande

GPUs and other accelerators are available on many different devices, while GPGPU has been massively adopted by the HPC research community. Although a plethora of libraries and applications providing GPU support are available, the need of implementing new algorithms from scratch, or adapting sequential programs to accelerators, will always exist. Writing CUDA or OpenCL codes, although an easier task than using their predecessors, is not trivial. Obtaining performance is even harder, as it requires deep understanding of the underlying architecture. Some efforts have been directed toward the automatic code generation for GPU devices, with different results. In particular, several directive-oriented programming models, taking advantage of the OpenMP success, have been created. Although future OpenMP releases will integrate accelerators into the standard, tools are needed in the meantime. In this work, we present a comparison between three directive-based programming models: hiCUDA, PGI Accelerator and OpenACC, using for the last our novel accULL implementation. With this comparison, we aim to showcase the evolution of the directive-based programming models and how users can guide tools toward better performance results.

parallel computing | 2006

Basic skeletons in 11c

Antonio J. Dorta; Pablo López; Francisco de Sande

11c is a high-level parallel language that provides support for some of the most widely used algorithmic skeletons. The language has a syntax based on OpenMP-like directives and the compiler uses direct translation to MPI to produce parallel code. To evaluate the performance of our prototype compiler we present computational results for some of the skeletons available in the language on different architectures. Particularly in the presence of coarse-grain parallelism, the results reflect similar performances for the 11c compiled version and ad hoc MPI or OpenMP implementations. In all cases, the performance loss with respect to a direct MPI implementation is clearly compensated by a significantly smaller development effort.

Parallel Processing Letters | 2003

LLC: A PARALLEL SKELETAL LANGUAGE

Antonio J. Dorta; Jesus A. González; Casiano Rodríguez; Francisco de Sande

The skeletal approach to the development of parallel applications has been revealed to be one of the most successful and has been widely explored in the recent years. The goal of this approach is to develop a methodology of parallel programming based on a restricted set of parallel constructs. This paper presents llc, a parallel skeletal language, the theoretical model that gives support to the language and a prototype implementation for its compiler. The language is based on directives, uses a C-like syntax and gives support to the most widely used skeletal constructs. llCoMP is a source to source compiler for the language built on top of MPI. We evaluate the performance of our prototype compiler using four different parallel architectures and three algorithms. We present the results obtained in both shared and distributed memory architectures. Our model guarantees the portability of the language to any platform and its simplicity greatly eases its implementation.

The Journal of Supercomputing | 2013

A preliminary evaluation of OpenACC implementations

Ruymán Reyes; Ivan Lopez; Juan J. Fumero; Francisco de Sande

During the last few years, the availability of hardware accelerators, such as GPUs, has rapidly increased. However, the entry cost to GPU programming is high and requires a considerable porting and tuning effort. Some research groups and vendors have made attempts to ease the situation by defining APIs and languages that simplify these tasks. In the wake of the success of OpenMP, industria and academia are working toward defining a new standard of compiler directives to leverage the GPU programming effort. Support from vendors and similarities with the upcoming OpenMP 4.0 standard lead us to believe that OpenACC is a good alternative for developers who want to port existing codes to accelerators. In this paper, we evaluate three OpenACC implementations: two commercial implementations (PGI and CAPS) and our own research implementation, accULL, to evaluate the current status and future directions of the standard.

Microprocessors and Microsystems | 2012

Optimization strategies in different CUDA architectures using llCoMP

Ruymán Reyes; Francisco de Sande

Due to the current proliferation of GPU devices in HPC environments, scientist and engineers spend much of their time optimizing codes for these platforms. At the same time, manufactures produce new versions of their devices every few years, each one more powerful than the last. The question that arises is: is it optimization effort worthwhile? In this paper, we present a review of the different CUDA architectures, including Fermi, and optimize a set of algorithms for each using widely-known optimization techniques. This work would require a tremendous coding effort if done manually. However, using our fast prototyping tool, this is an effortless process. The result of our analysis will guide developers on the right path towards efficient code optimization. Preliminary results show that some optimizations recommended for older CUDA architectures may not be useful for the newer ones.

The Journal of Supercomputing | 2011

Automatic code generation for GPUs in llc

Ruymán Reyes; Francisco de Sande

Abstractllc is a C-based language where parallelism is expressed using compiler directives. In this paper, we present a new backend of an llc compiler that produces code for GPUs. We have also implemented a software architecture that eases the development of new backends. Our design represents an intermediate layer between a high-level parallel language and different hardware architectures.We evaluate our development by comparing the OpenMP and llc parallelizations of three different algorithms. In every case, the probable performance loss with respect to a direct CUDA implementation is clearly compensated by a significantly smaller development effort.

Lecture Notes in Computer Science | 2005

Implementing openMP for clusters on top of MPI

Antonio J. Dorta; José M. Badía; Enrique S. Quintana; Francisco de Sande

llc is a language designed to extend OpenMP to distributed memory systems. Work in progress on the implementation of a compiler that translates llc code and targets distributed memory platforms is presented. Our approach generates code for communications directly on top of MPI. We present computational results for two different benchmark applications on a PC-cluster platform. The results reflect similar performances for the llc compiled version and an ad-hoc MPI implementation, even for applications with fine-grain parallelism.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

Automatic Hybrid MPI+OpenMP Code Generation with llc

Ruymán Reyes; Antonio J. Dorta; Francisco Almeida; Francisco de Sande

The evolution of the architecture of massively parallel computers is progressing toward systems with a hierarchical hardware design where each node is a shared memory system with several multi-core CPUs. There is consensus in the HPC community about the need to increment the efficiency of the programming paradigms used to exploit massive parallelism, if we are to make a full use of the advantages offered by these new architectures. llc is a language where parallelism is expressed using OpenMP compiler directives. In this paper we focus our attention on the automatic generation of hybrid MPI + OpenMP code from the llc annotated source code.

parallel computing | 2000

A new parallel model for the analysis of asynchronous algorithms

Casiano Rodríguez; José L. Roda; Francisco de Sande; Domingo Morales; Francisco Almeida

The BSP model barrier synchronization imposes some limits both in the range of available algorithms and also in their performance. Although BSP programs can be translated to MPI/PVM programs, the counterpart is not true. The asynchronous nature of some MPI/PVM programs does not easily fit inside the BSP model. Through the suppression of barriers and the generalization of the concept of superstep we propose two new models, the BSP-like and the BSP without barriers (BSPWB) models. While the BSP-like extends the BSP* model to programs written using collective operations, the more general BSPWB model admits the MPI/PVM parallel asynchronous programming style. The parameters of the models and their quality are evaluated on four standard parallel platforms: the CRAY T3E, the IBM SP2, the Origin 2000 and the Digital Alpha Server 8400. The study shows that the time spent in an h-relation is more independent on the number of processors than on the communication pattern. We illustrate the use of these BSP extensions through two problem-solving paradigms: the Nested Parallel Recursive Divide and Conquer Paradigm and the Virtual Pipeline Dynamic Programming Paradigm. The proposed paradigms explain how nested parallelism and processor virtualization can be introduced in MPI and PVM without having any negative impact in the performance and model accuracy. The prediction of the communication times is robust even for problems, where communication is dominated by small messages.

Explore More