Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where P. Sadayappan is active.

Publication


Featured researches published by P. Sadayappan.


programming language design and implementation | 2008

A practical automatic polyhedral parallelizer and locality optimizer

Uday Bondhugula; Albert Hartono; J. Ramanujam; P. Sadayappan

We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model -- far beyond what is possible by current production compilers. Unlike previous works, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high speedups for local and parallel execution on multi-cores over state-of-the-art compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.


high-performance computer architecture | 2008

Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems

Jiang Lin; Qingda Lu; Xiaoning Ding; Zhao Zhang; Xiaodong Zhang; P. Sadayappan

Cache partitioning and sharing is critical to the effective utilization of multicore processors. However, almost all existing studies have been evaluated by simulation that often has several limitations, such as excessive simulation time, absence of OS activities and proneness to simulation inaccuracy. To address these issues, we have taken an efficient software approach to supporting both static and dynamic cache partitioning in OS through memory address mapping. We have comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS). Our software approach makes it possible to run the SPEC CPU2006 benchmark suite to completion. Besides confirming important conclusions from previous work, we are able to gain several insights from whole-program executions, which are infeasible from simulation. For example, giving up some cache space in one program to help another one may improve the performance of both programs for certain workloads due to reduced contention for memory bandwidth. Our evaluation of previously proposed fairness metrics is also significantly different from a simulation-based study. The contributions of this study are threefold. (1) To the best of our knowledge, this is a highly comprehensive execution- and measurement-based study on multicore cache partitioning. This paper not only confirms important conclusions from simulation-based studies, but also provides new insights into dynamic behaviors and interaction effects. (2) Our approach provides a unique and efficient option for evaluating multicore cache partitioning. The implemented software layer can be used as a tool in multicore performance evaluation and hardware design. (3) The proposed schemes can be further refined for OS kernels to improve performance.


ieee international conference on high performance computing data and analytics | 2009

Scalable work stealing

James Dinan; D. Brian Larkins; P. Sadayappan; Sriram Krishnamoorthy; Jarek Nieplocha

Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.


IEEE Transactions on Parallel and Distributed Systems | 1991

Compile-time techniques for data distribution in distributed memory machines

J. Ramanujam; P. Sadayappan

A solution to the problem of partitioning data for distributed memory machines is discussed. The solution uses a matrix notation to describe array accesses in fully parallel loops, which allows the derivation of sufficient conditions for communication-free partitioning (decomposition) of arrays. A series of examples that illustrate the effectiveness of the technique for linear references, the use of loop transformations in deriving the necessary data decompositions, and a formulation that aids in deriving heuristics for minimizing a communication when communication-free partitions are not feasible are presented. >


Journal of Parallel and Distributed Computing | 1992

Tiling multidimensional iteration spaces for multicomputers

J. Ramanujam; P. Sadayappan

This paper addresses the problem of compiling perfectly nested loops for multicomputers (distributed-memory machines). The relatively high communication start-up costs in these machines renders frequent communication very expensive. Motivated by this concern, we present a method of aggregating a number of loop iterations into tiles where the tiles execute atomically-a processor executing the iterations belonging to a tile receives all the data it needs before executing any one of the iterations in the tile, executes all the iterations in the tile, and then sends the data needed by other processors. Since synchronization is not allowed during the execution of a tile, partitioning the iteration space into tiles must not result in deadlock. We first show the equivalence between the problem of finding partitions and the problem of determining the cone for a given set of dependence vectors. We then present an approach to partitioning the iteration space into deadlock-free tiles so that communication volume is minimized. In addition, we discuss a method for optimizing the size of tiles for nested loops on multicomputers. This work differs from other approaches to tiling in that we present a method of optimizing grain size of tiles for multicomputers.


programming language design and implementation | 2007

Effective automatic parallelization of stencil computations

Sriram Krishnamoorthy; Muthu Manikandan Baskaran; Uday Bondhugula; J. Ramanujam; Atanas Rountev; P. Sadayappan

Performance optimization of stencil computations has been widely studied in the literature, since they occur in many computationally intensive scientific and engineering applications. Compiler frameworks have also been developed that can transform sequential stencil codes for optimization of data locality and parallelism. However, loop skewing is typically required in order to tile stencil codes along the time dimension, resulting in load imbalance in pipelined parallel execution of the tiles. In this paper, we develop an approach for automatic parallelization of stencil codes, that explicitly addresses the issue of load-balanced execution of tiles. Experimental results are provided that demonstrate the effectiveness of the approach.


compiler construction | 2008

Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

Uday Bondhugula; Muthu Manikandan Baskaran; Sriram Krishnamoorthy; J. Ramanujam; Atanas Rountev; P. Sadayappan

The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a complex sequence of execution-reordering loop transformations that can improve performance by parallelization as well as locality enhancement. Although a significant body of research has addressed affine scheduling and partitioning, the problem of automaticallyfinding good affine transforms forcommunication-optimized coarsegrained parallelization together with locality optimization for the general case of arbitrarily-nested loop sequences remains a challenging problem. We propose an automatic transformation framework to optimize arbitrarilynested loop sequences with affine dependences for parallelism and locality simultaneously. The approach finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communication-minimized coarse-grained parallelization as well as for locality optimization. The approach enables the minimization of inter-tile communication volume in the processor space, and minimization of reuse distances for local execution at each node. Programs requiring one-dimensional versusmulti-dimensional time schedules (with scheduling-based approaches) are all handled with the same algorithm. Synchronization-free parallelism, permutable loops or pipelined parallelismat various levels can be detected. Preliminary studies of the framework show promising results.


international conference on supercomputing | 2012

High-performance code generation for stencil computations on GPU architectures

Justin Holewinski; Louis-Noël Pouchet; P. Sadayappan

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these architectures offer challenges for developers and compilers alike. Stencil computations in particular require careful attention to off-chip memory access and the balancing of work among compute units in GPU devices. In this paper, we present a code generation scheme for stencil computations on GPU accelerators, which optimizes the code by trading an increase in the computational workload for a decrease in the required global memory bandwidth. We develop compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation. We show that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.


international conference on cluster computing | 2009

Scalable I/O forwarding framework for high-performance computing systems

Nawab Ali; Philip H. Carns; Kamil Iskra; Dries Kimpe; Samuel Lang; Robert Latham; Robert B. Ross; Lee Ward; P. Sadayappan

Current leadership-class machines suffer from a significant imbalance between their computational power and their I/O bandwidth. While Moores law ensures that the computational power of high-performance computing systems increases with every generation, the same is not true for their I/O subsystems. The scalability challenges faced by existing parallel file systems with respect to the increasing number of clients, coupled with the minimalistic compute node kernels running on these machines, call for a new I/O paradigm to meet the requirements of data-intensive scientific applications. I/O forwarding is a technique that attempts to bridge the increasing performance and scalability gap between the compute and I/O components of leadership-class machines by shipping I/O calls from compute nodes to dedicated I/O nodes. The I/O nodes perform operations on behalf of the compute nodes and can reduce file system traffic by aggregating, rescheduling, and caching I/O requests. This paper presents an open, scalable I/O forwarding framework for high-performance computing systems. We describe an I/O protocol and API for shipping function calls from compute nodes to I/O nodes, and we present a quantitative analysis of the overhead associated with I/O forwarding.


Proceedings of the IEEE | 2005

Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models

Gerald Baumgartner; Alexander A. Auer; David E. Bernholdt; Alina Bibireata; Venkatesh Choppella; Daniel Cociorva; Xiaoyang Gao; Robert J. Harrison; So Hirata; Sriram Krishnamoorthy; Sandhya Krishnan; Chi-Chung Lam; Qingda Lu; Marcel Nooijen; Russell M. Pitzer; J. Ramanujam; P. Sadayappan; Alexander Sibiryakov

This paper provides an overview of a program synthesis system for a class of quantum chemistry computations. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. The input to the system is a a high-level specification of the computation, from which the system can synthesize high-performance parallel code tailored to the characteristics of the target architecture. Several components of the synthesis system are described, focusing on performance optimization issues that they address.

Collaboration


Dive into the P. Sadayappan's collaboration.

Top Co-Authors

Avatar

J. Ramanujam

Louisiana State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

David E. Bernholdt

Oak Ridge National Laboratory

View shared research outputs
Researchain Logo
Decentralizing Knowledge