Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Guilherme Ottoni is active.

Publication


Featured researches published by Guilherme Ottoni.


international symposium on microarchitecture | 2005

Automatic Thread Extraction with Decoupled Software Pipelining

Guilherme Ottoni; Ram Rangan; Adam Stoler; David I. August

Until recently, a steadily rising clock rate and other uniprocessor micro architectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturers to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have not succeeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes. To find useful work for chip multiprocessors, we propose an automatic approach to thread extraction, called decoupled software pipelining (DSWP). DSWP exploits the finegrained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Use of the nonspeculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing intercore communication and per-core resource requirements. Using our initial fully automatic compiler implementation and a validated processor model, we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety of codes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promising future for this approach.


international symposium on microarchitecture | 2004

RIFLE: An Architectural Framework for User-Centric Information-Flow Security

Neil Vachharajani; Matthew J. Bridges; Jonathan Chang; Ram Rangan; Guilherme Ottoni; Jason A. Blome; George A. Reis; Manish Vachharajani; David I. August

Even as modern computing systems allow the manipulation and distribution of massive amounts of information, users of these systems are unable to manage the confidentiality of their data in a practical fashion. Conventional access control security mechanisms cannot prevent the illegitimate use of privileged data once access is granted. For example, information provided by a user during an online purchase may be covertly delivered to malicious third parties by an untrustworthy web browser. Existing information-flow security mechanisms do provide this assurance, but only for programmer-specified policies enforced during program development as a static analysis on special-purpose type-safe languages. Not only are these techniques not applicable to many commonly used programs, but they leave the user with no defense against malicious programmers or altered binaries. In this paper, we propose RIFLE, a runtime information-flow security system designed from the users perspective. By addressing information-flow security using architectural support, RIFLE gives users a practical way to enforce their own information-flow security policy on all programs. We prove that, contrary to statements in the literature, run-time systems like RIFLE are no less secure than existing language-based techniques. Using a model of the architectural framework and a binary translator, we demonstrate RIFLEs correctness and illustrate that the performance cost is reasonable.


symposium on code generation and optimization | 2008

Parallel-stage decoupled software pipelining

Easwaran Raman; Guilherme Ottoni; Arun Raman; Matthew J. Bridges; David I. August

In recent years, the microprocessor industry has embraced chip multiprocessors (CMPs), also known as multi-core architectures, as the dominant design paradigm. For existing and new applications to make effective use of CMPs, it is desirable that compilers automatically extract thread-level parallelism from single-threaded applications. DOALL is a popular automatic technique for loop-level parallelization employed successfully in the domains of scientific and numeric computing. While DOALL generally scales well with the number of iterations of the loop, its applicability is limited by the presence of loop-carried dependences. A parallelization technique with greater applicability is decoupled software pipelining (DSWP), which parallelizes loops even in the presence of loop-carried dependences. However, the scalability of DSWP is limited by the size of the loop body and the number of recurrences it contains, which are usually smaller than the loop iteration count. This work proposes a novel non-speculative compiler parallelization technique called parallel-stage decoupled software pipelining (PS-DSWP). The goal of PS-DSWP is to combine the applicability of DSWP with the scalability of DOALL parallelization. A key insight of PS-DSWP is that, after isolating the recurrences in their own stages in DSWP, portions of the loop suitable for DOALL parallelization may be exposed. PS-DSWP extends DSWP to benefit from these opportunities, utilizing multiple threads to execute the same stage of a DSWPed loop in parallel. This paper describes the PS-DSWP transformation in detail and discusses its implementation in a research compiler. PS-DSWP produces an average speedup of 114% (up to a maximum of 155%) with 6 threads on loops from a set of 5 applications. Our experiments also demonstrate that PS-DSWP achieves better scalability with the number of threads than DSWP.


international conference on parallel architectures and compilation techniques | 2007

Speculative Decoupled Software Pipelining

Neil Vachharajani; Ram Rangan; Easwaran Raman; Matthew J. Bridges; Guilherme Ottoni; David I. August

In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. To avoid burdening programmers with the responsibility of parallelizing their applications, some researchers have advocated automatic thread extraction. A recently proposed technique, Decoupled software pipelining (DSWP), has demonstrated promise by partitioning loops into long-running, fine-grained threads organized into a pipeline. Using a pipeline organization and execution decoupled by inter-core communication queues, DSWP offers increased execution efficiency that is largely independent of inter-core communication latency. This paper proposes adding speculation to DSWP and evaluates an automatic approach for its implementation. By speculating past infrequent dependences, the benefit of DSWP is increased by making it applicable to more loops, facilitating better balanced threads, and enabling parallelized loops to be run on more cores. Unlike prior speculative threading proposals, speculative DSWP focuses on breaking dependence recurrences. By speculatively breaking these recurrences, instructions that were formerly restricted to a single thread to ensure decoupling are now free to span multiple threads. Using an initial automatic compiler implementation and a validated processor model, this paper demonstrates significant gains using speculation for 4-core chip multiprocessor models running a variety of codes.


international symposium on microarchitecture | 2007

Global Multi-Threaded Instruction Scheduling

Guilherme Ottoni; David I. August

The microprocessor industry has moved toward chip multiprocessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-threaded codes, a characteristic of most applications. In order to support parallelization of general-purpose applications, computer architects have proposed CMPs with lightweight scalar communication mechanisms (R. Rangan et al., 2004), (K. Sankaralingam, et al., 2003), (M.B. Taylor et al., 2005). Despite such support, most existing compiler multi-threading techniques have generally demonstrated little effectiveness in extracting parallelism from non-scientific applications (W. Lee et al., 1998), (W. Lee et al., 2002), (K. Rich and M. Farrens, 2004). The main reason for this is that such techniques are mostly restricted to extracting parallelism within straight-line regions of code. In this paper, we first propose a framework that enables global multi-threaded instruction scheduling in general. We then describe GREMIO, a scheduler built using this framework. GREMIO operates at a global scope, at the procedure level, and uses control dependence analysis to extract non-speculative thread-level parallelism from sequential codes. Using a fully automatic compiler implementation of GREMIO and a validated processor model, this paper demonstrates gains for a dual-core CMP model running a variety of codes. Our experiments demonstrate the advantage of exploiting global scheduling for multithreaded architectures, and present gains in a detailed comparison with the decoupled software pipelining (DSWP) multi-threading technique (G. Ottoni et al., 2005). Furthermore, our experiments show that adding GREMIO to a compiler with DSWP improves the average speedup from 16.5% to 32.8% for important benchmark functions when utilizing two cores, indicating the importance of this technique in making compilers extract threads effectively.


symposium on code generation and optimization | 2005

Practical and Accurate Low-Level Pointer Analysis

Bolei Guo; Matthew J. Bridges; Spyridon Triantafyllis; Guilherme Ottoni; Easwaran Raman; David I. August

Pointer analysis is traditionally performed once, early in the compilation process, upon an intermediate representation (IR) with source-code semantics. However, performing pointer analysis only once at this level imposes a phase-ordering constraint, causing alias information to become stale after subsequent code transformations. Moreover, high-level pointer analysis cannot be used at link time or run time, where the source code is unavailable. This paper advocates performing pointer analysis on a low-level intermediate representation. We present the first context-sensitive and partially flow-sensitive points-to analysis designed to operate at the assembly level. As we will demonstrate, low-level pointer analysis can be as accurate as high-level analysis. Additionally, our low-level pointer analysis also enables a quantitative comparison of propagating high-level pointer analysis results through subsequent code transformations, versus recomputing them at the low level. We show that, for C programs, the former practice is considerably less accurate than the latter.


programming language design and implementation | 2006

A framework for unrestricted whole-program optimization

Spyridon Triantafyllis; Matthew J. Bridges; Easwaran Raman; Guilherme Ottoni; David I. August

Procedures have long been the basic units of compilation in conventional optimization frameworks. However, procedures are typically formed to serve software engineering rather than optimization goals, arbitrarily constraining code transformations. Techniques, such as aggressive inlining and interprocedural optimization, have been developed to alleviate this problem, but, due to code growth and compile time issues, these can be applied only sparingly.This paper introduces the Procedure Boundary Elimination (PBE) compilation framework, which allows unrestricted whole-program optimization. PBE allows all intra-procedural optimizations and analyses to operate on arbitrary subgraphs of the program, regardless of the original procedure boundaries and without resorting to inlining. In order to control compilation time, PBE also introduces novel extensions of region formation and encapsulation. PBE enables targeted code specialization, which recovers the specialization benefits of inlining while keeping code growth in check. This paper shows that PBE attains better performance than inlining with half the code growth.


international symposium on microarchitecture | 2006

Support for High-Frequency Streaming in CMPs

Ram Rangan; Neil Vachharajani; Adam Stoler; Guilherme Ottoni; David I. August; George Cai

As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intra-thread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution


IEEE Computer Architecture Letters | 2006

From sequential programs to concurrent threads

Guilherme Ottoni; Ram Rangan; Adam Stoler; Matthew J. Bridges; David I. August

Chip multiprocessors are of increasing importance due to difficulties in achieving higher clock frequencies in uniprocessors, but their success depends on finding useful work for the processor cores. This paper addresses this challenge by presenting a simple compiler approach that extracts non-speculative thread-level parallelism from sequential codes. We present initial results from this technique targeting a validated dual-core processor model, achieving speedups ranging from 9-48% with an average of 25% for important benchmark loops over their single-threaded versions. We also identify important next steps found during our pursuit of higher degrees of automatic threading


compiler construction | 2001

Optimal Live Range Merge for Address Register Allocation in Embedded Programs

Guilherme Ottoni; Sandro Rigo; Guido Araujo; Subramanian Rajagopalan; Sharad Malik

The increasing demand for wireless devices running mobile applications has renewed the interest on the research of high performance low power processors that can be programmed using very compact code. One way to achieve this goal is to design specialized processors with short instruction formats and shallow pipelines. Given that it enables such architectural features, indirect addressing is the most used addressing mode in embedded programs. This paper analyzes the problem of allocating address registers to array references in loops using auto-increment addressing mode. It leverages on previous work, which is based on a heuristic that merges address register live ranges. We prove, for the first time, that the merge operation is NP-hard in general, and show the existence of an optimal linear-time algorithm, based on dynamic programming, for a special case of the problem.

Collaboration


Dive into the Guilherme Ottoni's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Guido Araujo

State University of Campinas

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge