Easwaran Raman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Easwaran Raman is active.

Explore More

Publication

Featured researches published by Easwaran Raman.

symposium on code generation and optimization | 2008

Parallel-stage decoupled software pipelining

Easwaran Raman; Guilherme Ottoni; Arun Raman; Matthew J. Bridges; David I. August

In recent years, the microprocessor industry has embraced chip multiprocessors (CMPs), also known as multi-core architectures, as the dominant design paradigm. For existing and new applications to make effective use of CMPs, it is desirable that compilers automatically extract thread-level parallelism from single-threaded applications. DOALL is a popular automatic technique for loop-level parallelization employed successfully in the domains of scientific and numeric computing. While DOALL generally scales well with the number of iterations of the loop, its applicability is limited by the presence of loop-carried dependences. A parallelization technique with greater applicability is decoupled software pipelining (DSWP), which parallelizes loops even in the presence of loop-carried dependences. However, the scalability of DSWP is limited by the size of the loop body and the number of recurrences it contains, which are usually smaller than the loop iteration count. This work proposes a novel non-speculative compiler parallelization technique called parallel-stage decoupled software pipelining (PS-DSWP). The goal of PS-DSWP is to combine the applicability of DSWP with the scalability of DOALL parallelization. A key insight of PS-DSWP is that, after isolating the recurrences in their own stages in DSWP, portions of the loop suitable for DOALL parallelization may be exposed. PS-DSWP extends DSWP to benefit from these opportunities, utilizing multiple threads to execute the same stage of a DSWPed loop in parallel. This paper describes the PS-DSWP transformation in detail and discusses its implementation in a research compiler. PS-DSWP produces an average speedup of 114% (up to a maximum of 155%) with 6 threads on loops from a set of 5 applications. Our experiments also demonstrate that PS-DSWP achieves better scalability with the number of threads than DSWP.

international conference on parallel architectures and compilation techniques | 2007

Speculative Decoupled Software Pipelining

Neil Vachharajani; Ram Rangan; Easwaran Raman; Matthew J. Bridges; Guilherme Ottoni; David I. August

In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. To avoid burdening programmers with the responsibility of parallelizing their applications, some researchers have advocated automatic thread extraction. A recently proposed technique, Decoupled software pipelining (DSWP), has demonstrated promise by partitioning loops into long-running, fine-grained threads organized into a pipeline. Using a pipeline organization and execution decoupled by inter-core communication queues, DSWP offers increased execution efficiency that is largely independent of inter-core communication latency. This paper proposes adding speculation to DSWP and evaluates an automatic approach for its implementation. By speculating past infrequent dependences, the benefit of DSWP is increased by making it applicable to more loops, facilitating better balanced threads, and enabling parallelized loops to be run on more cores. Unlike prior speculative threading proposals, speculative DSWP focuses on breaking dependence recurrences. By speculatively breaking these recurrences, instructions that were formerly restricted to a single thread to ensure decoupling are now free to span multiple threads. Using an initial automatic compiler implementation and a validated processor model, this paper demonstrates significant gains using speculation for 4-core chip multiprocessor models running a variety of codes.

symposium on code generation and optimization | 2008

Spice: speculative parallel iteration chunk execution

Easwaran Raman; Neil Va hharajani; Ram Rangan; David I. August

The recent trend in the processor industry of packing multiple processor cores in a chip has increased the importance of automatic techniques for extracting thread level parallelism. A promising approach for extracting thread level parallelism in general purpose applications is to apply memory alias or value speculation to break dependences amongst threads and executes them concurrently. In this work, we present a speculative parallelization technique called Speculative Parallel Iteration Chunk execution (Spice) which relies on a novel software-only value prediction mechanism. Our value prediction technique predicts the loop live-ins of only a few iterations of a given loop, enabling speculative threads to start from those iterations. It also increases the probability of successful speculation by only predicting that the values will be used as live-ins in some future iterations of the loop. These twin properties enable our value prediction scheme to have high prediction accuracies while exposing significant coarse-grained thread-level parallelism. Spice has been implemented as an automatic transformation in a research compiler. The technique results in up to 157% speedup (101% on average) with 4 threads.

symposium on code generation and optimization | 2005

Practical and Accurate Low-Level Pointer Analysis

Bolei Guo; Matthew J. Bridges; Spyridon Triantafyllis; Guilherme Ottoni; Easwaran Raman; David I. August

Pointer analysis is traditionally performed once, early in the compilation process, upon an intermediate representation (IR) with source-code semantics. However, performing pointer analysis only once at this level imposes a phase-ordering constraint, causing alias information to become stale after subsequent code transformations. Moreover, high-level pointer analysis cannot be used at link time or run time, where the source code is unavailable. This paper advocates performing pointer analysis on a low-level intermediate representation. We present the first context-sensitive and partially flow-sensitive points-to analysis designed to operate at the assembly level. As we will demonstrate, low-level pointer analysis can be as accurate as high-level analysis. Additionally, our low-level pointer analysis also enables a quantitative comparison of propagating high-level pointer analysis results through subsequent code transformations, versus recomputing them at the low level. We show that, for C programs, the former practice is considerably less accurate than the latter.

programming language design and implementation | 2006

A framework for unrestricted whole-program optimization

Spyridon Triantafyllis; Matthew J. Bridges; Easwaran Raman; Guilherme Ottoni; David I. August

Procedures have long been the basic units of compilation in conventional optimization frameworks. However, procedures are typically formed to serve software engineering rather than optimization goals, arbitrarily constraining code transformations. Techniques, such as aggressive inlining and interprocedural optimization, have been developed to alleviate this problem, but, due to code growth and compile time issues, these can be applied only sparingly.This paper introduces the Procedure Boundary Elimination (PBE) compilation framework, which allows unrestricted whole-program optimization. PBE allows all intra-procedural optimizations and analyses to operate on arbitrary subgraphs of the program, regardless of the original procedure boundaries and without resorting to inlining. In order to control compilation time, PBE also introduces novel extensions of region formation and encapsulation. PBE enables targeted code specialization, which recovers the specialization benefits of inlining while keeping code growth in check. This paper shows that PBE attains better performance than inlining with half the code growth.

Proceedings of the 2005 workshop on Memory system performance | 2005

Recursive data structure profiling

Easwaran Raman; David I. August

As the processor-memory performance gap increases, so does the need for aggressive data structure optimizations to reduce memory access latencies. Such optimizations require a better understanding of the memory behavior of programs. We propose a profiling technique called Recursive Data Structure Profiling to help better understand the memory access behavior of programs that use recursive data structures (RDS) such as lists, trees, etc. An RDS profile captures the runtime behavior of the individual instances of recursive data structures. RDS profiling differs from other memory profiling techniques in its ability to aggregate information pertaining to an entire data structure instance, rather than merely capturing the behavior of individual loads and stores, thereby giving a more global view of a programs memory accesses.This paper describes a method for collecting RDS profile without requiring any high-level program representation or type information. RDS profiling achieves this with manageable space and time overhead on a mixture of pointer intensive benchmarks from the SPEC, Olden and other benchmark suites. To illustrate the potential of the RDS profile in providing a better understanding of memory accesses, we introduce a metric to quantify the notion of stability of an RDS instance. A stable RDS instance is one that undergoes very few changes to its structure between its initial creation and final destruction, making it an attractive candidate to certain data structure optimizations.

symposium on code generation and optimization | 2007

Structure Layout Optimization for Multithreaded Programs

Easwaran Raman; Robert Hundt; Sandya Srivilliputtur Mannarswamy

Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both for improved spatial locality and reduced false sharing, simultaneously. We develop a semi-automatic tool that produces actual structure layouts for multi-threaded programs and outputs the key factors contributing to the layout decisions. We apply this tool on the HP-UX kernel and demonstrate the effects of these transformations for a variety of already highly hand-tuned key structures with different set of properties. We show that naive heuristics can result in massive performance degradations on such a highly tuned application, while our technique generally avoids those pitfalls. The improved structures produced by our tool improve performance by up to 3.2% over a highly tuned baseline

symposium on code generation and optimization | 2004

Exposing memory access regularities using object-relative memory profiling

Qiang Wu; Artem Pyatakov; Alexey Spiridonov; Easwaran Raman; Douglas W. Clark; David I. August

Memory profiling is the process of characterizing a programs memory behavior by observing and recording its response to specific input sets. Relevant aspects of the programs memory behavior may then be used to guide memory optimizations in an aggressively optimizing compiler. In general, memory access behavior has eluded meaningful characterization because of confounding artifacts from memory allocators, linker data layout, and OS memory management. Since these artifacts may change from run to run, memory access patterns may appear different in each run even for the same input set. Worse, regular memory access behavior such as linked list traversals appear to have no structure. We present object-relative translation and decomposition techniques to eliminate these artifacts and to expose previously obscured memory access patterns. To demonstrate the potential of these ideas, we implement two different memory profilers targeted at different sets of applications. These profilers outperform the existing ones in terms of profile size and useful information per byte of data. The first profiler is a lossless profiler, called WHOMP, which uses object-relativity to achieve a 22% better compression than the previously best known scheme. The second profiler, called LEAP, uses lossy compression to get highly compact profiles while providing useful information to the targeted applications. LEAP correctly characterizes the memory alias rates for 56% more instruction pairs than the previously best known scheme with a practical running time.

ieee international conference on high performance computing data and analytics | 2005

Integrating a new cluster assignment and scheduling algorithm into an experimental retargetable code generation framework

K. Vasanta Lakshmi; Deepak Sreedhar; Easwaran Raman; Priti Shankar

This paper presents a new unified algorithm for cluster assignment and region scheduling, and its integration into an experimental retargetable code generation framework. The components of the framework are an instruction selector generator based on a recent technique, the IMPACT front end, a machine description module which uses a modification of the HMDES machine description language to include cluster information, a combined cluster allocator and an acyclic region scheduler, and a register allocator. Experiments have been carried out on the targeting of the tool to the Texas Instruments TMS320c62x architecture. We report preliminary results on a set of TI benchmarks.

Archive | 2007