Fernando Magno Quintão Pereira

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fernando Magno Quintão Pereira is active.

Explore More

Publication

Featured researches published by Fernando Magno Quintão Pereira.

programming language design and implementation | 2008

Register allocation by puzzle solving

Fernando Magno Quintão Pereira; Jens Palsberg

We show that register allocation can be viewed as solving a collection of puzzles. We model the register file as a puzzle board and the program variables as puzzle pieces; pre-coloring and register aliasing fit in naturally. For architectures such as PowerPC, x86, and StrongARM, we can solve the puzzles in polynomial time, and we have augmented the puzzle solver with a simple heuristic for spilling. For SPEC CPU2000, the compilation time of our implementation is as fast as that of the extended version of linear scan used by LLVM, which is the JIT compiler in the openGL stack of Mac OS 10.5. Our implementation produces x86 code that is of similar quality to the code produced by the slower, state-of-the-art iterated register coalescing of George and Appel with the extensions proposed by Smith, Ramsey, and Holloway in 2004.

international conference on parallel architectures and compilation techniques | 2011

Divergence Analysis and Optimizations

Bruno Coutinho; Diogo Sampaio; Fernando Magno Quintão Pereira; Wagner Meira

The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait, this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidias SDK. Our divergence analysis has a 34% false-positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds a 3% speed-up onto parallel quick sort, a heavily optimized benchmark. Our manual optimizations extend this number to over 10%.

symposium on code generation and optimization | 2009

Wave Propagation and Deep Propagation for Pointer Analysis

Fernando Magno Quintão Pereira; Daniel Berlin

This paper describes two new algorithms for solving inclusion based points-to analysis. The first algorithm, the Wave Propagation Method, is a modified version of an early technique presented by Pearce et al; however, it greatly improves on the running time of its predecessor. The second algorithm, the Deep Propagation Method, is a more light-weighted analysis, that requires less memory. We have compared these algorithms with three state-of-the-art techniques by Hardekopf-Lin, Heintze-Tardieu and Pearce-Kelly-Hankin. Our experiments show that Deep Propagation has the best average execution time across a suite of 17 well-known benchmarks, the lowest memory requirements in absolute numbers, and the fastest absolute times for benchmarks under 100,000 lines of code. The memory-hungry Wave Propagation has the fastest absolute running times in a memory rich execution environment, matching the speed of the best known points-to analysis algorithms in large benchmarks.

symposium on code generation and optimization | 2013

A fast and low-overhead technique to secure programs against integer overflows

Fernando Magno Quintão Pereira; Raphael Ernani Rodrigues; Victor Hugo Sperle Campos

The integer primitive type has upper and lower bounds in many programming languages, including C, and Java. These limits might lead programs that manipulate large integer numbers to produce unexpected results due to overflows. There exists a plethora of works that instrument programs to track the occurrence of these overflows. In this paper we present an algorithm that uses static range analysis to avoid this instrumentation whenever possible. Our range analysis contains novel techniques, such as a notion of “future” bounds to handle comparisons between variables. We have used this algorithm to avoid some checks created by a dynamic instrumentation library that we have implemented in LLVM. This framework has been used to detect overflows in hundreds of C/C++ programs. As a testimony of its effectiveness, our range analysis has been able to avoid 25% of all the overflow checks necessary to secure the C programs in the LLVM test suite. This optimization has reduced the runtime overhead of instrumentation by 50%.

symposium on code generation and optimization | 2013

Just-in-time value specialization

Henrique Nazaré Santos; Péricles Alves; Igor Rafael de Assis Costa; Fernando Magno Quintão Pereira

JavaScript emerges today as one of the most important programming languages for the development of client-side web applications. Therefore, it is essential that browsers be able to execute JavaScript programs efficiently. However, the dynamic nature of this programming language makes it very challenging to achieve this much needed efficiency. In this paper we propose parameter-based value specialization as a way to improve the quality of the code produced by JIT engines. We have empirically observed that almost 60% of the JavaScript functions found in the worlds 100 most popular websites are called only once, or are called with the same parameters. Capitalizing on this observation, we adapt a number of classic compiler optimizations to specialize code based on the runtime values of functions actual parameters. We have implemented the techniques proposed in this paper in IonMonkey, an industrial quality JavaScript JIT compiler developed in the Mozilla Foundation. Our experiments, run across three popular JavaScript benchmarks, SunSpider, V8 and Kraken, show that, in spite of its highly speculative nature, our optimization pays for itself. As an example, we have been able to speedup SunSpider by 5.38%, and to reduce the size of its native code by 16.72%.

conference on object-oriented programming systems, languages, and applications | 2014

Validation of memory accesses through symbolic analyses

Henrique Nazaré; Izabela Maffra; Willer Santos; Leonardo Barbosa; Laure Gonnord; Fernando Magno Quintão Pereira

The C programming language does not prevent out-of-bounds memory accesses. There exist several techniques to secure C programs; however, these methods tend to slow down these programs substantially, because they populate the binary code with runtime checks. To deal with this problem, we have designed and tested two static analyses - symbolic region and range analysis - which we combine to remove the majority of these guards. In addition to the analyses themselves, we bring two other contributions. First, we describe live range splitting strategies that improve the efficiency and the precision of our analyses. Secondly, we show how to deal with integer overflows, a phenomenon that can compromise the correctness of static algorithms that validate memory accesses. We validate our claims by incorporating our findings into AddressSanitizer. We generate SPEC CINT 2006 code that is 17% faster and 9% more energy efficient than the code produced originally by this tool. Furthermore, our approach is 50% more effective than Pentagons, a state-of-the-art analysis to sanitize memory accesses.

international conference on parallel architectures and compilation techniques | 2014

Compiler support for selective page migration in NUMA architectures

Guilherme Piccoli; Henrique Nazaré Santos; Raphael Ernani Rodrigues; Christiane Pousa; Edson Borin; Fernando Magno Quintão Pereira

Current high-performance multicore processors provide users with a non-uniform memory access model (NUMA). These systems perform better when threads access data on memory banks next to the core where they run. However, ensuring data locality is difficult. In this paper, we propose compiler analyses and code generation methods to support a lightweight runtime system that dynamically migrates memory pages to improve data locality. Our technique combines static and dynamic analyses and is capable of identifying the most promising pages to migrate. Statically, we infer the size of arrays, plus the amount of reuse of each memory access instruction in a program. These estimates rely on a simple, yet accurate, trip count predictor of our own design. This knowledge lets us build templates of dynamic checks, to be filled with values known only at runtime. These checks determine when it is profitable to migrate data closer to the processors where this data is used. Our static analyses are quadratic on the number of variables in a program, and the dynamic checks are O(1) in practice. Our technique does not require any form of user intervention, neither the support of a third-party middleware, nor modifications in the operating systems kernel. We have applied our technique on several parallel algorithms, which are completely oblivious to the asymmetric memory topology, and have observed speedups of up to 4×, compared to static heuristics. We compare our approach against Minas, a middleware that supports NUMA-aware data allocation, and show that we can outperform it by up to 50% in some cases.

compiler construction | 2016

Sparse representation of implicit flows with applications to side-channel detection

Bruno Rodrigues; Fernando Magno Quintão Pereira; Diego F. Aranha

Information flow analyses traditionally use the Program Dependence Graph (PDG) as a supporting data-structure. This graph relies on Ferrante et al.s notion of control dependences to represent implicit flows of information. A limitation of this approach is that it may create O(|I| x |E|) implicit flow edges in the PDG, where I are the instructions in a program, and E are the edges in its control flow graph. This paper shows that it is possible to compute information flow analyses using a different notion of implicit dependence, which yields a number of edges linear on the number of definitions plus uses of variables. Our algorithm computes these dependences in a single traversal of the programs dominance tree. This efficiency is possible due to a key property of programs in Static Single Assignment form: the definition of a variable dominates all its uses. Our algorithm correctly implements Hunt and Sands system of security types. Contrary to their original formulation, which required O(IxI) space and time for structured programs, we require only O(I). We have used our ideas to build FlowTracker, a tool that uncovers side-channel vulnerabilities in cryptographic algorithms. FlowTracker handles programs with over one-million assembly instructions in less than 200 seconds, and creates 24% less implicit flow edges than Ferrante et al.s technique. FlowTracker has detected an issue in a constant-time implementation of Elliptic Curve Cryptography; it has found several time-variant constructions in OpenSSL, one issue in TrueCrypt and it has validated the isochronous behavior of the NaCl library.

foundations of software science and computation structure | 2006

Register allocation after classical SSA elimination is NP-Complete

Fernando Magno Quintão Pereira; Jens Palsberg

Chaitin proved that register allocation is equivalent to graph coloring and hence NP-complete. Recently, Bouchez, Brisk, and Hack have proved independently that the interference graph of a program in static single assignment (SSA) form is chordal and therefore colorable in linear time. Can we use the result of Bouchez et al. to do register allocation in polynomial time by first transforming the program to SSA form, then performing register allocation, and finally doing the classical SSA elimination that replaces φ-functions with copy instructions? In this paper we show that the answer is no, unless P = NP: register allocation after classical SSA elimination is NP-complete. Chaitins proof technique does not work for programs after classical SSA elimination; instead we use a reduction from the graph coloring problem for circular arc graphs.

compiler construction | 2009

SSA Elimination after Register Allocation

Fernando Magno Quintão Pereira; Jens Palsberg

Compilers such as gcc use static-single-assignment (SSA) form as an intermediate representation and usually perform SSA elimination before register allocation. But the order could as well be the opposite: the recent approach of SSA-based register allocation performs SSA elimination after register allocation. SSA elimination before register allocation is straightforward and standard, while previously described approaches to SSA elimination after register allocation have shortcomings; in particular, they have problems with implementing copies between memory locations. We present spill-free SSA elimination , a simple and efficient algorithm for SSA elimination after register allocation that avoids increasing the number of spilled variables. We also present three optimizations of the core algorithm. Our experiments show that spill-free SSA elimination takes less than five percent of the total compilation time of a JIT compiler. Our optimizations reduce the number of memory accesses by more than 9% and improve the program execution time by more than 1.8%.

Explore More