Rodrigo C. O. Rocha
University of Edinburgh
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rodrigo C. O. Rocha.
Concurrency and Computation: Practice and Experience | 2017
Rodrigo C. O. Rocha; Alyson D. Pereira; Luiz E. Ramos; Luís Fabrício Wanderley Góes
The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on graphics processing units (GPUs). In particular, tiling is a technique that can significantly enhance application performance by improving data locality and by reducing the volume of communication between host memory and GPU. In addition, tiling enables stencil applications to process inputs that are larger than the physical GPU memory. However, implementing tiling efficiently is complex, time‐consuming, and error‐prone. In this paper, we propose transparently optimized automatic stencil tiling (TOAST), an automatic tiling mechanism for iterative stencil computations running on GPUs; TOAST has 3 main benefits: (1) It incorporates an optimization model that seeks to maximize data reuse within tiles while respecting the amount of dynamically available GPU memory; (2) it offers a virtualized GPU memory for stencil computations, allowing for large input data; and (3) it performs optimal tiling transparently to the developer of the parallel stencil application. The current implementation of TOAST augments the PSkel framework with an internal solver based on genetic algorithms. Our experimental results show that TOAST improves the performance of iterative stencil applications by up to 13 × compared with their multithreaded (central processing unit–based) optimized versions and up to 48 × compared with a naive tiling approach on GPU. The TOAST mechanism is able to automatically achieve a low percentual overhead of data management compared with actual stencil computation.
symposium on code generation and optimization | 2018
Vasileios Porpodas; Rodrigo C. O. Rocha; Luís Fabrício Wanderley Góes
Auto-vectorizing compilers automatically generate vector (SIMD) instructions out of scalar code. The state-of-the-art algorithm for straight-line code vectorization is Superword-Level Parallelism (SLP). In this work we identify a major limitation at the core of the SLP algorithm, in the performance-critical step of collecting the vectorization candidate instructions that form the SLP-graph data structure. SLP lacks global knowledge when building its vectorization graph, which negatively affects its local decisions when it encounters commutative instructions. We propose LSLP, an improved algorithm that can plug-in to existing SLP implementations, and can effectively vectorize code with arbitrarily long chains of commutative operations. LSLP relies on short-depth look-ahead for better-informed local decisions. Our evaluation on a real machine shows that LSLP can significantly improve the performance of real-world code with little compilation-time overhead.
symposium on computer architecture and high performance computing | 2017
Alyson D. Pereira; Rodrigo C. O. Rocha; Luiz E. Ramos; Márcio Castro; Luís Fabrício Wanderley Góes
The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on GPUs. However, most of the runtime systems that execute those applications often fail to fully utilize the parallelism of modern heterogeneous systems. In this paper, we propose a mechanism based on machine learning that automatically partitions stencil computations across CPU and GPU. We implemented it into the PSkel framework and found that the mechanism can boost the performance of stencil applications on average by 17.9x compared to their sequential CPU-only counterparts, by 1.34x compared to a GPU-only version, and by 1.48x compared to a parallel CPU-only version.
international conference on high performance computing and simulation | 2017
Alyson D. Pereira; Márcio Castro; Mario A. R. Dantas; Rodrigo C. O. Rocha; Luís Fabrício Wanderley Góes
The OpenACC programming model simplifies the programming for accelerator devices such as GPUs. Its abstract accelerator model defines a least common denominator for accelerator devices, thus it cannot represent architectural specifics of these devices without losing portability. Therefore, this general- purpose approach delivers good performance on average, but it misses optimization opportunities for code generation and execution of specific classes of applications. In this paper, we propose OpenACC extensions to enable efficient code generation and execution of stencil applications by parallel skeleton frameworks such as PSkel. Our results show that our stencil extensions may improve the performance of OpenACC in up to 28% and 45% on GPU and CPU, respectively. Moreover, we show that the work-partitioning mechanism offered by the skeleton framework, which splits the computation across CPU and GPU, may improve even further the performance of the applications in up to 18%.
Concurrency and Computation: Practice and Experience | 2016
Rodrigo C. O. Rocha; Bruno Hott; Vinícius Vitor dos Santos Dias; Renato Ferreira; Wagner Meira; Dorgival O. Guedes
Most high‐performance data processing (a.k.a. big data) systems allow users to express their computation using abstractions (like MapReduce), which simplify the extraction of parallelism from applications. Most frameworks, however, do not allow users to specify how communication must take place: That element is deeply embedded into the run‐time system abstractions, making changes hard to implement. In this work, we describe Wathershed‐ng, our re‐engineering of the Watershed system, a framework based on the filter–stream paradigm and originally focused on continuous stream processing. Like other big‐data environments, Watershed provided object‐oriented abstractions to express computation (filters), but the implementation of streams was a run‐time system element. By isolating stream functionality into appropriate classes, combination of communication patterns and reuse of common message handling functions (like compression and blocking) become possible. The new architecture even allows the design of new communication patterns, for example, allowing users to choose MPI, TCP, or shared memory implementations of communication channels as their problem demands. Applications designed for the new interface showed reductions in code size on the order of 50% and above in some cases. The performance results also showed significant improvements, because some implementation bottlenecks were removed in the re‐engineering process. Copyright
Brazilian Symposium on Programming Languages | 2016
Rodrigo C. O. Rocha; Luís Fabrício Wanderley Góes; Fernando Magno Quintão Pereira
The main challenge faced by automatic parallelization tools in functional languages is the fact that parallelism is often hidden under the syntax of complex recursive functions. In this paper, we propose an algebraic framework for parallelizing – automatically – two special classes of recursive functions. We show that these classes are comprehensive enough to include several well-known instances. We have used our ideas to implement a source-to-source compiler in Python to parallelize Haskell code. We have applied this prototype onto six different recursive functions, achieving, on a 4-core machine, speedups of up to 2.7x.
symposium on computer architecture and high performance computing | 2014
Rodrigo C. O. Rocha; Renato Ferreira; Wagner Meira; Dorgival O. Guedes
Most high-performance data processing (aka big-data) systems allow users to express their computation using abstractions (like map-reduce) that simplify the extraction of parallelism from applications. Most frameworks, however, do not allow users to specify how communication must take place: that element is deeply embedded into the run-time system (RTS), making changes hard to implement. In this work we describe our reengineering of the Watershed system, a framework based on the filter-stream paradigm and focused on continuous stream processing. Like other big-data environments, watershed provided object-oriented abstractions to express computation (filters), but the implementation of streams was an RTS element. By isolating stream functionality into appropriate classes, combination of communication patterns and reuse of common message handling functions (like compression and blocking) become possible. The new architecture even allow the design of new communication patterns, for example, allowing users to choose MPI, TCP or shared memory implementations of communication channels as their problem demand. Applications designed for the new interface showed reductions in code size on the order of 50%and above in some cases, with no significant performance penalty.
international conference on parallel architectures and compilation techniques | 2018
Vasileios Porpodas; Rodrigo C. O. Rocha; Luís Fabrício Wanderley Góes
Auto-vectorization techniques allow the compiler to automatically generate SIMD vector code out of scalar code. SLP is a commonly-used algorithm for converting straight-line code into vector code, which complements the loop-based traditional vectorizers. It works by scanning the input code looking for groups of instructions that can be combined into vectors and replacing them with the corresponding vector instructions. The state-of-the-art SLP algorithm works by attempting to vectorize blocks of code with a fixed vector width and falling back to smaller widths for the whole block upon failure. In this work we remove this limitation and introduce Variable-Width SLP (VW-SLP), a novel algorithm that is capable of adjusting the vector width at an instruction granularity. This allows the algorithm to better adapt to the codes SIMD parallelism characteristics, thus exposing more vector parallelism than before. We implemented VW-SLP in LLVM and our evaluation on a real system shows that it considerably improves the performance of real benchmark code, with a small increase in compilation time.
international conference on conceptual structures | 2017
Alyson D. Pereira; Rodrigo C. O. Rocha; Márcio Castro; Luís Fabrício Wanderley Góes; Mario A. R. Dantas
Abstract The OpenACC programming model simplifies the programming for accelerator devices such as GPUs. Its abstract accelerator model defines a least common denominator for accelerator devices, thus it cannot represent architectural specifics of these devices without losing portability. Therefore, this general-purpose approach delivers good performance on average, but it misses optimization opportunities for code generation and execution of specific classes of applications. In this paper, we propose stencil extensions to enable efficient code generation in OpenACC. Our results show that our stencil extensions may improve the performance of OpenACC in up to 28% and 45% on GPU and CPU, respectively.
Revista Ceres | 2000
M. de A. Botrel; R. de P. Ferreira; C. D. Cruz; A. V. Pereira; Maria Celuta Machado Viana; Rodrigo C. O. Rocha; M. Miranda