Ruymán Reyes
University of La Laguna
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ruymán Reyes.
international conference on parallel processing | 2012
Pedro Alonso; Rosa M. Badia; Jesús Labarta; Maria Barreda; Manuel F. Dolz; Rafael Mayo; Enrique S. Quintana-Ortí; Ruymán Reyes
Understanding power usage in parallel workloads is crucial to develop the energy-aware software that will run in future Exascale systems. In this paper, we contribute towards this goal by introducing an integrated framework to profile, monitor, model and analyze power dissipation in parallel MPI and multi-threaded scientific applications. The framework includes an own-designed device to measure internal DC power consumption and a package offering a simple interface to interact with this design as well as commercial power meters. Combined with the instrumentation package Extrae and the graphical analysis tool Paraver, the result is a useful environment to identify sources of power inefficiency directly in the source application code. For task-parallel codes, we also offer a statistical software module that inspects the execution trace of the application to calculate the parameters of an accurate model for the global energy consumption, which can be then decomposed into the average power usage per task or the nodal power dissipated per core.
ieee international conference on high performance computing data and analytics | 2012
Ruymán Reyes; Ivan Lopez; Juan J. Fumero; Francisco de Sande
GPUs and other accelerators are available on many different devices, while GPGPU has been massively adopted by the HPC research community. Although a plethora of libraries and applications providing GPU support are available, the need of implementing new algorithms from scratch, or adapting sequential programs to accelerators, will always exist. Writing CUDA or OpenCL codes, although an easier task than using their predecessors, is not trivial. Obtaining performance is even harder, as it requires deep understanding of the underlying architecture. Some efforts have been directed toward the automatic code generation for GPU devices, with different results. In particular, several directive-oriented programming models, taking advantage of the OpenMP success, have been created. Although future OpenMP releases will integrate accelerators into the standard, tools are needed in the meantime. In this work, we present a comparison between three directive-based programming models: hiCUDA, PGI Accelerator and OpenACC, using for the last our novel accULL implementation. With this comparison, we aim to showcase the evolution of the directive-based programming models and how users can guide tools toward better performance results.
The Journal of Supercomputing | 2013
Ruymán Reyes; Ivan Lopez; Juan J. Fumero; Francisco de Sande
During the last few years, the availability of hardware accelerators, such as GPUs, has rapidly increased. However, the entry cost to GPU programming is high and requires a considerable porting and tuning effort. Some research groups and vendors have made attempts to ease the situation by defining APIs and languages that simplify these tasks. In the wake of the success of OpenMP, industria and academia are working toward defining a new standard of compiler directives to leverage the GPU programming effort. Support from vendors and similarities with the upcoming OpenMP 4.0 standard lead us to believe that OpenACC is a good alternative for developers who want to port existing codes to accelerators. In this paper, we evaluate three OpenACC implementations: two commercial implementations (PGI and CAPS) and our own research implementation, accULL, to evaluate the current status and future directions of the standard.
Microprocessors and Microsystems | 2012
Ruymán Reyes; Francisco de Sande
Due to the current proliferation of GPU devices in HPC environments, scientist and engineers spend much of their time optimizing codes for these platforms. At the same time, manufactures produce new versions of their devices every few years, each one more powerful than the last. The question that arises is: is it optimization effort worthwhile? In this paper, we present a review of the different CUDA architectures, including Fermi, and optimize a set of algorithms for each using widely-known optimization techniques. This work would require a tremendous coding effort if done manually. However, using our fast prototyping tool, this is an effortless process. The result of our analysis will guide developers on the right path towards efficient code optimization. Preliminary results show that some optimizations recommended for older CUDA architectures may not be useful for the newer ones.
The Journal of Supercomputing | 2011
Ruymán Reyes; Francisco de Sande
Abstractllc is a C-based language where parallelism is expressed using compiler directives. In this paper, we present a new backend of an llc compiler that produces code for GPUs. We have also implemented a software architecture that eases the development of new backends. Our design represents an intermediate layer between a high-level parallel language and different hardware architectures.We evaluate our development by comparing the OpenMP and llc parallelizations of three different algorithms. In every case, the probable performance loss with respect to a direct CUDA implementation is clearly compensated by a significantly smaller development effort.
Journal of Computational Physics | 2013
José Ramón López-Blanco; Ruymán Reyes; José Ignacio Aliaga; Rosa M. Badia; Pablo Chacón; Enrique S. Quintana-Ortí
Abstract Normal modes in internal coordinates (IC) furnish an excellent way to model functional collective motions of macromolecular machines, but exhibit a high computational cost when applied to large-sized macromolecules. In this paper, we significantly extend the applicability of this approach towards much larger systems by effectively solving the computational bottleneck of these methods, the diagonalization step and associated large-scale eigenproblem, on a small cluster of nodes equipped with multicore technology. Our experiments show the superior performance of iterative Krylov-subspace methods for the solution of the dense generalized eigenproblems arising in these biological applications over more traditional direct solvers implemented on top of state-of-the-art libraries. The presented approach expedites the study of the collective conformational changes of large macromolecules opening a new window for exploring the functional motions of such relevant systems.
european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009
Ruymán Reyes; Antonio J. Dorta; Francisco Almeida; Francisco de Sande
The evolution of the architecture of massively parallel computers is progressing toward systems with a hierarchical hardware design where each node is a shared memory system with several multi-core CPUs. There is consensus in the HPC community about the need to increment the efficiency of the programming paradigms used to exploit massive parallelism, if we are to make a full use of the advantages offered by these new architectures. llc is a language where parallelism is expressed using OpenMP compiler directives. In this paper we focus our attention on the automatic generation of hybrid MPI + OpenMP code from the llc annotated source code.
international symposium on parallel and distributed processing and applications | 2012
Ruymán Reyes; Ivan Lopez; Juan J. Fumero; Francisco de Sande
The world of HPC is undergoing rapid changes and computer architectures capable to achieve high performance have broadened. The irruption in the scene of computational accelerators, like GPUs, is increasing performance while maintaining low cost per GFLOP, thus expanding the popularity of HPC. However, it is still difficult to exploit the new complex processor hierarchies. To adapt the message passing model to program heterogeneous CPU+GPUs environments is not an easy task. Furthermore, message passing does not seem to be the best choice from the programmer point of view. Traditional shared memory approaches like OpenMP are interesting to ease the popularization of these platforms, but the fact is that GPU devices are connected to the CPU through a bus and have a separate memory space. We need to find a way to deal with this issue at programming language level, otherwise, developers will spend most of their time focusing on low-level code details instead of algorithmic enhancements. The recent advent of the OpenACC standard for heterogeneous computing represents an effort in the direction of leveraging the development effort. This initiative, combined with future releases of the OpenMP standard, will converge into a fully heterogeneous framework that will cope the programming requirements of future computer architectures. In this work we present preliminary results of accULL, a novel implementation of the OpenACC standard, based on a source-to-source compiler and a runtime library. To our knowledge, our approach is the first providing support for both OpenCL and CUDA platforms under this new standard.
international symposium on parallel and distributed processing and applications | 2012
Maria Barreda; Manuel F. Dolz; Rafael Mayo; Enrique S. Quintana-Ortí; Ruymán Reyes
In this paper we combine a powerful tracing framework with a power measurement setup to perform a visual analysis of the computational performance and the power consumption of tuned implementations for three key dense linear algebra operations: the LU factorization, the Cholesky factorization, and the reduction to tridiagonal form. Our results using 6 and 12 cores of an AMD Opteron-based platform reveal the serial/concurrent phases of the algorithms, and their connection to periods of low/high power consumption, as well as the linear dependency between execution time and energy for this class of operations.
parallel, distributed and network-based processing | 2011
Ruymán Reyes; Francisco de Sande
Over the last few years, we have witnessed the proliferation of GPU devices onHPC environments. Manufacturers produce new versions of their devices every few years, though, posing a new problem for scientists and engineers using their technology: is it worth the time and effort spent optimizing the codes for the current version? Or it is better to wait until a new architecture appears? In this paper, we present a comparison of various CUDA versions, in order to compare their architectures, and optimize codes for each version. This work would require a tremendous coding effort if done manually. However, using fast prototyping tools, like llCoMP, this is an effortless process. Applying loop optimization techniques, we evaluate three different algorithms. With each one, we apply a set of optimization techniques, showing the performance benefit or penalty, in three CUDA architecture versions, including Fermi. The results of these techniques will guide developers on the right path towards efficient code optimization. Preliminary results show that some optimizations recommended for older CUDA architectures may not be useful in Fermi.