Marcio Machado Pereira
State University of Campinas
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marcio Machado Pereira.
ACM Transactions on Architecture and Code Optimization | 2017
Gleison Souza Diniz Mendonca; Breno Campos Ferreira Guimarães; Péricles Alves; Marcio Machado Pereira; Guido Araujo; Fernando Magno Quintão Pereira
Directive-based programming models, such as OpenACC and OpenMP, allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This article provides a suite of compiler-related methods to mitigate this problem. Such techniques rely on symbolic range analysis, a well-known static technique, to achieve two purposes: populate source code with data transfer primitives and to disambiguate pointers that could hinder automatic parallelization due to aliasing. We have materialized our ideas into a tool, DawnCC, which can be used stand-alone or through an online interface. To demonstrate its effectiveness, we show how DawnCC can annotate the programs available in PolyBench without any intervention from users. Such annotations lead to speedups of over 100× in an Nvidia architecture and over 50× in an ARM architecture.
symposium on computer architecture and high performance computing | 2016
Gleison Souza Diniz Mendonca; Breno Campos Ferreira Guimarães; Péricles Alves; Fernando Magno Quintão Pereira; Marcio Machado Pereira; Guido Araujo
Directive-based programming models, such as OpenACC and OpenMP arise today as promising techniques to support the development of parallel applications. These systems allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This paper provides one fundamental component in the solution of this problem. We introduce a static program analysis that infers the bounds of memory regions referenced in source code. Such bounds allow us to automatically insert data-transfer primitives, which are needed when the parallelized code is meant to be executed in an accelerator device, such as a GPU. To validate our ideas, we have applied them onto Polybench, using two different architectures: Nvidia and Qualcomm-based. We have successfully analyzed 98% of all the memory accesses in Polybench. This result has enabled us to insert automatic annotations into those benchmarks leading to speedups of over 100x.
international workshop on openmp | 2017
Marcio Machado Pereira; Rafael Cardoso Fernandes Sousa; Guido Araujo
Given their massively parallel computing capabilities heterogeneous architectures comprised of CPUs and accelerators have been increasingly used to speed-up scientific and engineering applications. Nevertheless, programming such architectures is a challenging task for most non-expert programmers as typical accelerator programming languages (e.g. CUDA and OpenCL) demand a thoroughly understanding of the underlying hardware to enable an effective application speed-up. To achieve that, programmers are usually required to significantly change and adapt program structures and algorithms, thus impacting both performance and productivity. A simpler alternative is to use high-level directive-based programming models like OpenACC and OpenMP. These models allow programmers to insert both directives and runtime calls into existing source code, thus providing hints to the compiler and runtime to perform certain transformations and optimizations on the annotated code regions. In this paper, we present ACLang, an open-source LLVM/Clang compiler framework (http://www.aclang.org) that implements the recently released OpenMP 4.X Accelerator Programming Model. ACLang automatically converts OpenMP 4.X annotated program regions into OpenCL/SPIR kernels, while providing a set of polyhedral based optimizations like tiling and vectorization. OpenCL kernels resulting from ACLang can be executed on any OpenCL/SPIR compatible acceleration device, not only GPUs, but also FPGA accelerators like those found in the Intel HARP architecture. To the best of our knowledge and at the time this paper was written, this is the first LLVM/Clang implementation of the OpenMP 4.X Accelerator Model that provides a source-to-target OpenCL conversion. Experiments using ACLang on the Polybench benchmark reveal speed-ups of up to 30x on an Exynos 8890 Octacore CPU with a ARM Mali-T880 MP12 GPU, up to 62x on a 2.4 GHz dual-core Intel Core i5 processor equipped with an Intel Iris GPU unit, and up to 112x on a 2.1 GHz 32 cores Intel-Xeon processor equipped with a Tesla K40c GPU.
parallel computing | 2016
Marcio Machado Pereira; Matthew Gaudet; J. Nelson Amaral; Guido Araujo
We evaluated the strengths and weaknesses of Intel extensions to HTM - TSX.We described features that are likely to yield performance gains when using TSX.We explored with the aid of a new tool called htm-pBuilder the performance of TSX.We introduced a efficient policy for guaranteeing forward progress on top of TSX.We explored various fall-back policy tunings and transaction properties of TSX. This paper presents an extensive performance study of the implementation of Hardware Transactional Memory (HTM) in the Haswell generation of Intel x86 core processors. It evaluates the strengths and weaknesses of this new architecture by exploring several dimensions in the space of Transactional Memory (TM) application characteristics using the Eigenbench?(Hong et?al., 2010 1) and the CLOMP-TM?(Schindewolf et?al., 2012 2), benchmarks. This paper also introduces a new tool, called htm-pBuilder that tailors fallback policies and allows independent exploration of its parameters.This detailed performance study provides insights on the constraints imposed by the Intels Transaction Synchronization Extension (Intels TSX) and introduces a simple, but efficient policy for guaranteeing forward progress on top of the best-effort Intels HTM which was critical to achieving performance. The evaluation also shows that there are a number of potential improvements for designers of TM applications and software systems that use Intels TM and provides recommendations to extract maximum benefit from the current TM support available in Haswell.
symposium on computer architecture and high performance computing | 2014
Marcio Machado Pereira; Matthew Gaudet; José Nelson Amaral; Guido Araujo
This paper presents an extensive performance study of the implementation of Hardware Transactional Memory (HTM) in the Haswell generation of Intel x86 core processors. This study evaluates the strengths and weaknesses of this new architecture exploring several dimensions in the space of Transactional Memory (TM) application characteristics using the Eigenbench [1] and the CLOMP-TM [2] benchmarks. This detailed performance study provides insights on the constraints imposed by the Intels Transaction Synchronization Extension (Intels TSX) and introduces a simple, but efficient policy for guaranteeing forward progress on top of the besteffort Intels HTM and also was critical to achieving performance. The evaluation also shows that there are a number of potential improvements for designers of TM applications and software systems that use Intels TM and provides recommendations to extract maximum benefit from the current TM support available in Haswell.
ieee international conference on high performance computing, data, and analytics | 2013
Marcio Machado Pereira; Alexandro Baldassin; Guido Araujo; Luiz Eduardo Buzato
In the last few years, Transactional Memories (TMs) have been shown to be a parallel programming model that can effectively combine performance improvement with ease of programming. Moreover, the recent introduction of TM-based ISA extensions, by major microprocessor manufacturers, also seems to endorse TM as a programming model for todays parallel applications. One of the central issues in designing Software TM (STM) systems is to identify mechanisms/heuristics that can minimize contention arising from conflicting transactions. Although a number of mechanisms have been proposed to tackle contention, such techniques have a limited scope, as conflict is avoided by either interrupting or serializing transaction execution, thus considerably impacting performance. To deal with this limitation, we have proposed a new effective transaction scheduler, along with a conflict-avoidance heuristic, that implements a fully cooperative scheduler that switches a conflicting transaction by another with a lower conflicting probability. This paper extends such framework and introduces a new heuristic, built from the combination of our previous conflict avoidance technique with the Contention Intensity heuristic proposed by Yoo and Lee. Experimental results, obtained using the STMBench7 and STAMP benchmarks atop tinySTM, show that the proposed heuristic produces significant speedups when compared to other four solutions.
symposium on computer architecture and high performance computing | 2017
Maicol Zegarra; Marcio Machado Pereira; Xavier Martorell; Guido Araujo
Prefix Scan (or simply scan) is an operator that computes all the partial sums of a vector. A scan operation results in a vector where each element is the sum of the preceding elements in the original vector up to the corresponding position. Scan is a key operation in many relevant problems like sorting, lexical analysis, string comparison, image filtering among others. Although there are libraries that provide hand-parallelized implementations of scan in CUDA and OpenCL, no automatic parallelization solution exists for this operator in OpenMP. This paper proposes a new clause for OpenMP which enables the automatic synthesis of the parallel scan. By using the proposed clause a programmer can considerably reduce the complexity of designing scan based algorithms, thus allowing he or she to focus the attention on the problem and not on learning new parallel programming models or languages. Scan was designed in AClang, an open-source LLVM/Clang compiler framework that implements the recently released OpenMP 4.X Accelerator Programming Model. Experiments running a set of typical scan based algorithms on NVIDIA, Intel, and ARM GPUs reveal that the performance of the proposed OpenMP clause is equivalent to that achieved when using OpenCL library calls, with the advantage of a simpler programming complexity.
symposium on computer architecture and high performance computing | 2017
Rafael Cardoso Fernandes Sousa; Marcio Machado Pereira; Fernando Magno Quintão Pereira; Guido Araujo
Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance gains achieved by acceleration. Although this problem has been extensively studied for multicore architectures and was recently tackled in discrete GPUs through CUDA8, no generic solution exists for integrated CPU/GPUs architectures like those found in mobile devices (e.g. ARM Mali). This paper proposes Data Coherence Analysis (DCA), a set of two data-flow analyses that determine how variables are used by host/device at each program point. It also introduces Data Coherence Optimization (DCO), a code optimization technique that uses DCA information to: (a) allocate OpenCL shared buffers between host and devices; and (b) insert appropriate OpenCL function calls into program points so as to minimize the number of data coherence operations. DCO was implemented in AClang LLVM (www.aclang.org) a compiler capable of translating OpenMP 4.X annotated loops to OpenCL kernels, thus hiding the complexity of directly programming in OpenCL. Experimental results using DCA and DCO in AClang to compile programs from the Parboil, Polybench and Rodinia benchmarks reveal performance speed-ups of up to 5.25x on an Exynos 8890 Octacore CPU with ARM Mali-T880 MP12 GPU and up to 2.03x on a 2.4 GHz dual-core Intel Core i5 processor equipped with an Intel Iris GPU unit.
brazilian conference on intelligent systems | 2014
Marcio Machado Pereira; José Nelson Amaral; Guido Araujo
One of the greatest challenges of modern computing is the development of software optimized for parallel execution in multi-core processors. Transactional Memory (TM) is a new trend in concurrency control that has emerged to address these challenges. TM promises the performance of finer grain locks combined with lower programming complexity. However, transactional memories are speculative and rely on contention managers to resolve conflicts between transactions. This paper explores a complementary approach to boost the performance of TM through the use of schedulers. A TM scheduler is a software component that decides when a particular transaction should be executed. TM scheduling mechanisms are typically restricted to either serialization or yielding. Moreover, their effectiveness is very sensitive to the accuracy of the metric used to predict transaction behavior, particularly in high-contention scenarios. This paper proposes a new Dynamic Transaction Scheduler (DTS) to select a transaction to execute next, based on a new policy that rewards success and uses an improved metric that measures the amount of effective work performed by a transaction. An experimental evaluation indicates that scheduling transactions based on DTS can provide good average-case performance.
field programmable custom computing machines | 2018
Ciro Ceissler; Ramon Nepomuceno; Marcio Machado Pereira; Guido Araujo