J. Ramanujam | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where J. Ramanujam is active.

Explore More

Publication

Featured researches published by J. Ramanujam.

programming language design and implementation | 2008

A practical automatic polyhedral parallelizer and locality optimizer

Uday Bondhugula; Albert Hartono; J. Ramanujam; P. Sadayappan

We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model -- far beyond what is possible by current production compilers. Unlike previous works, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high speedups for local and parallel execution on multi-cores over state-of-the-art compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.

design automation conference | 2001

Dynamic management of scratch-pad memory space

Mahmut T. Kandemir; J. Ramanujam; J. Irwin; Narayanan Vijaykrishnan; Ismail Kadayif; A. Parikh

Optimizations aimed at improving the efficiency of on-chip memories are extremely important. We propose a compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that uses both loop and data transformations. Experimental results obtained using a generic cost model indicate significant reductions in data transfer activity between SPM and off-chip memory.

IEEE Transactions on Parallel and Distributed Systems | 1991

Compile-time techniques for data distribution in distributed memory machines

J. Ramanujam; P. Sadayappan

A solution to the problem of partitioning data for distributed memory machines is discussed. The solution uses a matrix notation to describe array accesses in fully parallel loops, which allows the derivation of sufficient conditions for communication-free partitioning (decomposition) of arrays. A series of examples that illustrate the effectiveness of the technique for linear references, the use of loop transformations in deriving the necessary data decompositions, and a formulation that aids in deriving heuristics for minimizing a communication when communication-free partitions are not feasible are presented. >

Journal of Parallel and Distributed Computing | 1992

Tiling multidimensional iteration spaces for multicomputers

J. Ramanujam; P. Sadayappan

This paper addresses the problem of compiling perfectly nested loops for multicomputers (distributed-memory machines). The relatively high communication start-up costs in these machines renders frequent communication very expensive. Motivated by this concern, we present a method of aggregating a number of loop iterations into tiles where the tiles execute atomically-a processor executing the iterations belonging to a tile receives all the data it needs before executing any one of the iterations in the tile, executes all the iterations in the tile, and then sends the data needed by other processors. Since synchronization is not allowed during the execution of a tile, partitioning the iteration space into tiles must not result in deadlock. We first show the equivalence between the problem of finding partitions and the problem of determining the cone for a given set of dependence vectors. We then present an approach to partitioning the iteration space into deadlock-free tiles so that communication volume is minimized. In addition, we discuss a method for optimizing the size of tiles for nested loops on multicomputers. This work differs from other approaches to tiling in that we present a method of optimizing grain size of tiles for multicomputers.

programming language design and implementation | 2007

Effective automatic parallelization of stencil computations

Sriram Krishnamoorthy; Muthu Manikandan Baskaran; Uday Bondhugula; J. Ramanujam; Atanas Rountev; P. Sadayappan

Performance optimization of stencil computations has been widely studied in the literature, since they occur in many computationally intensive scientific and engineering applications. Compiler frameworks have also been developed that can transform sequential stencil codes for optimization of data locality and parallelism. However, loop skewing is typically required in order to tile stencil codes along the time dimension, resulting in load imbalance in pipelined parallel execution of the tiles. In this paper, we develop an approach for automatic parallelization of stencil codes, that explicitly addresses the issue of load-balanced execution of tiles. Experimental results are provided that demonstrate the effectiveness of the approach.

compiler construction | 2008

Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

Uday Bondhugula; Muthu Manikandan Baskaran; Sriram Krishnamoorthy; J. Ramanujam; Atanas Rountev; P. Sadayappan

The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a complex sequence of execution-reordering loop transformations that can improve performance by parallelization as well as locality enhancement. Although a significant body of research has addressed affine scheduling and partitioning, the problem of automaticallyfinding good affine transforms forcommunication-optimized coarsegrained parallelization together with locality optimization for the general case of arbitrarily-nested loop sequences remains a challenging problem. We propose an automatic transformation framework to optimize arbitrarilynested loop sequences with affine dependences for parallelism and locality simultaneously. The approach finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communication-minimized coarse-grained parallelization as well as for locality optimization. The approach enables the minimization of inter-tile communication volume in the processor space, and minimization of reuse distances for local execution at each node. Programs requiring one-dimensional versusmulti-dimensional time schedules (with scheduling-based approaches) are all handled with the same algorithm. Synchronization-free parallelism, permutable loops or pipelined parallelismat various levels can be detected. Preliminary studies of the framework show promising results.

Proceedings of the IEEE | 2005

Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models

Gerald Baumgartner; Alexander A. Auer; David E. Bernholdt; Alina Bibireata; Venkatesh Choppella; Daniel Cociorva; Xiaoyang Gao; Robert J. Harrison; So Hirata; Sriram Krishnamoorthy; Sandhya Krishnan; Chi-Chung Lam; Qingda Lu; Marcel Nooijen; Russell M. Pitzer; J. Ramanujam; P. Sadayappan; Alexander Sibiryakov

This paper provides an overview of a program synthesis system for a class of quantum chemistry computations. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. The input to the system is a a high-level specification of the computation, from which the system can synthesize high-performance parallel code tailored to the characteristics of the target architecture. Several components of the synthesis system are described, focusing on performance optimization issues that they address.

acm sigplan symposium on principles and practice of parallel programming | 2008

Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Muthu Manikandan Baskaran; Uday Bondhugula; Sriram Krishnamoorthy; J. Ramanujam; Atanas Rountev; P. Sadayappan

Several parallel architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. They also have very high computational power with multiple levels of parallelism. A significant challenge in programming these architectures is to effectively exploit the parallelism available in the architecture and manage the fast memories to maximize performance. In this paper we develop an approach to effective automatic data management for on-chip memories, including creation of buffers in on-chip (local) memories for holding portions of data accessed in a computational block, automatic determination of array access functions of local buffer references, and generation of code that moves data between slow off-chip memory and fast local memories. We also address the problem of mapping computation in regular programs to multi-level parallel architectures using a multi-level tiling approach, and study the impact of on-chip memory availability on the selection of tile sizes at various levels. Experimental results on a GPU demonstrate the effectiveness of the proposed approach.

IEEE Transactions on Computers | 1999

Improving cache locality by a combination of loop and data transformations

Mahmut T. Kandemir; J. Ramanujam; Alok N. Choudhary

Exploiting locality of reference is key to realizing high levels of performance on modern processors. This paper describes a compiler algorithm for optimizing cache locality in scientific codes on uniprocessor and multiprocessor machines. A distinctive characteristic of our algorithm is that it considers loop and data layout transformations in a unified framework. Our approach is very effective at reducing cache misses and can optimize some nests for which optimization techniques based on loop transformations alone are not successful. An important special case is one in which data layouts of some arrays are fixed and cannot be changed. We show how our algorithm can accommodate this case and demonstrate how it can be used to optimize multiple loop nests. Experiments on several benchmarks show that the techniques presented in this paper result in substantial improvement in cache performance.

international symposium on microarchitecture | 1998

Improving locality using loop and data transformations in an integrated framework

Mahmut T. Kandemir; Alok N. Choudhary; J. Ramanujam; Prithviraj Banerjee

This paper presents a new integrated compiler framework for improving the cache performance of scientific applications. In addition to applying loop transformations, the method includes data layout optimizations, i.e., those that change the memory layouts of data structures (arrays in this case). A key characteristic of this approach is that loop transformations are used to improve temporal locality while data layout optimizations are used to improve spatial locality. This optimization framework was used with sixteen loop nests from several benchmarks and math libraries, and the performance was measured using a cache simulator in addition to using a single node of the SGI Origin 2000 distributed-shared-memory machine for measuring actual execution times. The results demonstrate that this approach is very effective in improving locality and outperforms current solutions that use either loop or data transformations alone. We expect that our solution will also enable better register usage due to increased temporal locality in the innermost loop, and that it will help in eliminating false-sharing on multiprocessors due to exploiting spatial locality in the innermost loop.

Explore More