Muthu Manikandan Baskaran

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Muthu Manikandan Baskaran is active.

Explore More

Publication

Featured researches published by Muthu Manikandan Baskaran.

programming language design and implementation | 2007

Effective automatic parallelization of stencil computations

Sriram Krishnamoorthy; Muthu Manikandan Baskaran; Uday Bondhugula; J. Ramanujam; Atanas Rountev; P. Sadayappan

Performance optimization of stencil computations has been widely studied in the literature, since they occur in many computationally intensive scientific and engineering applications. Compiler frameworks have also been developed that can transform sequential stencil codes for optimization of data locality and parallelism. However, loop skewing is typically required in order to tile stencil codes along the time dimension, resulting in load imbalance in pipelined parallel execution of the tiles. In this paper, we develop an approach for automatic parallelization of stencil codes, that explicitly addresses the issue of load-balanced execution of tiles. Experimental results are provided that demonstrate the effectiveness of the approach.

compiler construction | 2008

Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

Uday Bondhugula; Muthu Manikandan Baskaran; Sriram Krishnamoorthy; J. Ramanujam; Atanas Rountev; P. Sadayappan

The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a complex sequence of execution-reordering loop transformations that can improve performance by parallelization as well as locality enhancement. Although a significant body of research has addressed affine scheduling and partitioning, the problem of automaticallyfinding good affine transforms forcommunication-optimized coarsegrained parallelization together with locality optimization for the general case of arbitrarily-nested loop sequences remains a challenging problem. We propose an automatic transformation framework to optimize arbitrarilynested loop sequences with affine dependences for parallelism and locality simultaneously. The approach finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communication-minimized coarse-grained parallelization as well as for locality optimization. The approach enables the minimization of inter-tile communication volume in the processor space, and minimization of reuse distances for local execution at each node. Programs requiring one-dimensional versusmulti-dimensional time schedules (with scheduling-based approaches) are all handled with the same algorithm. Synchronization-free parallelism, permutable loops or pipelined parallelismat various levels can be detected. Preliminary studies of the framework show promising results.

acm sigplan symposium on principles and practice of parallel programming | 2008

Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Muthu Manikandan Baskaran; Uday Bondhugula; Sriram Krishnamoorthy; J. Ramanujam; Atanas Rountev; P. Sadayappan

Several parallel architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. They also have very high computational power with multiple levels of parallelism. A significant challenge in programming these architectures is to effectively exploit the parallelism available in the architecture and manage the fast memories to maximize performance. In this paper we develop an approach to effective automatic data management for on-chip memories, including creation of buffers in on-chip (local) memories for holding portions of data accessed in a computational block, automatic determination of array access functions of local buffer references, and generation of code that moves data between slow off-chip memory and fast local memories. We also address the problem of mapping computation in regular programs to multi-level parallel architectures using a multi-level tiling approach, and study the impact of on-chip memory availability on the selection of tile sizes at various levels. Experimental results on a GPU demonstrate the effectiveness of the proposed approach.

general purpose processing on graphics processing units | 2010

A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Allen K. Leung; Nicolas Vasilache; Benoît Meister; Muthu Manikandan Baskaran; David E. Wohlford; Cédric Bastoul; Richard Lethin

Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability. This paper discusses the implementation, in the R-Stream compiler, of a source to source mapping pathway from a high-level, textbook-style algorithm expression method in ANSI C, to multi-GPGPU accelerated computers. The compiler performs hierarchical decomposition and parallelization of the algorithm between and across host, multiple GPGPUs, and within-GPU. The semantic transformations are expressed within the polyhedral model, including optimization of integrated parallelization, locality, and contiguity tradeoffs. Hierarchical tiling is performed. Communication and synchronizations operations at multiple levels are generated automatically. The resulting mapping is currently emitted in the CUDA programming language. The GPU backend adds to the range of hardware and accelerator targets for R-Stream and indicates the potential for performance portability of single sources across multiple hardware targets.

international conference on supercomputing | 2009

Parametric multi-level tiling of imperfectly nested loops

Albert Hartono; Muthu Manikandan Baskaran; Cédric Bastoul; Albert Cohen; Sriram Krishnamoorthy; Boyana Norris; J. Ramanujam; P. Sadayappan

Tiling is a crucial loop transformation for generating high performance code on modern architectures. Efficient generation of multi-level tiled code is essential for maximizing data reuse in systems with deep memory hierarchies. Tiled loops with parametric tile sizes (not compile-time constants) facilitate runtime feedback and dynamic optimizations used in iterative compilation and automatic tuning. Previous parametric multi-level tiling approaches have been restricted to perfectly nested loops, where all assignment statements are contained inside the innermost loop of a loop nest. Previous solutions to tiling for imperfect loop nests have only handled fixed tile sizes. In this paper, we present an approach to parametric multi-level tiling of imperfectly nested loops. The tiling technique generates loops that iterate over full rectangular tiles, making them amenable to compiler optimizations such as register tiling. Experimental results using a number of computational benchmarks demonstrate the effectiveness of the developed tiling approach.

acm sigplan symposium on principles and practice of parallel programming | 2009

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Muthu Manikandan Baskaran; Nagavijayalakshmi Vydyanathan; Uday Bondhugula; J. Ramanujam; Atanas Rountev; P. Sadayappan

Recent advances in polyhedral compilation technology have made it feasible to automatically transform affine sequential loop nests for tiled parallel execution on multi-core processors. However, for multi-statement input programs with statements of different dimensionalities, such as Cholesky or LU decomposition, the parallel tiled code generated by existing automatic parallelization approaches may suffer from significant load imbalance, resulting in poor scalability on multi-core systems. In this paper, we develop a completely automatic parallelization approach for transforming input affine sequential codes into efficient parallel codes that can be executed on a multi-core system in a load-balanced manner. In our approach, we employ a compile-time technique that enables dynamic extraction of inter-tile dependences at run-time, and dynamic scheduling of the parallel tiles on the processor cores for improved scalable execution. Our approach obviates the need for programmer intervention and re-writing of existing algorithms for efficient parallel execution on multi-cores. We demonstrate the usefulness of our approach through comparisons using linear algebra computations: LU and Cholesky decomposition.

symposium on code generation and optimization | 2010

Parameterized tiling revisited

Muthu Manikandan Baskaran; Albert Hartono; Sanket Tavarageri; Thomas Henretty; J. Ramanujam; P. Sadayappan

Tiling, a key transformation for optimizing programs, has been widely studied in literature. Parameterized tiled code is important for auto-tuning systems since they often execute a large number of runs with dynamically varied tile sizes. Previous work on tiled code generation has addressed parameterized tiling for the sequential context, and the parallel case with fixed compile-time constants for tile sizes. In this paper, we revisit the problem of generating tiled code using parametric tile sizes. We develop a systematic approach to formulate tiling transformations through manipulation of linear inequalities and develop a novel approach to overcoming the fundamental obstacle faced by previous approaches regarding generation of parallel parameterized tiled code. To the best of our knowledge, the approach proposed in this paper is the first compile-time solution to the problem of parallel parameterized code generation for affine imperfectly nested loops. Experimental results demonstrate the effectiveness of the implemented system.

international parallel and distributed processing symposium | 2010

Optimal loop unrolling for GPGPU programs

Giridhar Sreenivasa Murthy; Mahesh Ravishankar; Muthu Manikandan Baskaran; P. Sadayappan

Graphics Processing Units (GPUs) are massively parallel, many-core processors with tremendous computational power and very high memory bandwidth. With the advent of general purpose programming models such as NVIDIAs CUDA and the new standard OpenCL, general purpose programming using GPUs (GPGPU) has become very popular. However, the GPU architecture and programming model have brought along with it many new challenges and opportunities for compiler optimizations. One such classical optimization is loop unrolling. Current GPU compilers perform limited loop unrolling. In this paper, we attempt to understand the impact of loop unrolling on GPGPU programs. We develop a semi-automatic, compile-time approach for identifying optimal unroll factors for suitable loops in GPGPU programs. In addition, we propose techniques for reducing the number of unroll factors evaluated, based on the characteristics of the program being compiled and the device being compiled to. We use these techniques to evaluate the effect of loop unrolling on a range of GPGPU programs and show that we correctly identify the optimal unroll factors. The optimized versions run up to 70% faster than the unoptimized versions.

international parallel and distributed processing symposium | 2010

DynTile: Parametric tiled loop generation for parallel execution on multicore processors

Albert Hartono; Muthu Manikandan Baskaran; J. Ramanujam; P. Sadayappan

Loop tiling is an important compiler transformation used for enhancing data locality and exploiting coarsegrained parallelism. Tiled codes in which tile sizes are runtime parameters — called parametrically-tiled codes — are important for empirical tuning systems like ATLAS. Some recent work has addressed the problem of generating sequential parametric tiled code. In this paper we describe DynTile, a system for transforming untiled sequential input C code containing affine imperfectly nested loops to parametrically tiled code for parallel execution on multicore processors. The effectiveness of the system is demonstrated using a number of benchmarks on an eight-core system.

2012 IEEE Conference on High Performance Extreme Computing | 2012

Efficient and scalable computations with sparse tensors

Muthu Manikandan Baskaran; Benoît Meister; Nicolas Vasilache; Richard A. Lethin

For applications that deal with large amounts of high dimensional multi-aspect data, it becomes natural to represent such data as tensors or multi-way arrays. Multi-linear algebraic computations such as tensor decompositions are performed for summarization and analysis of such data. Their use in real-world applications can span across domains such as signal processing, data mining, computer vision, and graph analysis. The major challenges with applying tensor decompositions in real-world applications are (1) dealing with large-scale high dimensional data and (2) dealing with sparse data. In this paper, we address these challenges in applying tensor decompositions in real data analytic applications. We describe new sparse tensor storage formats that provide storage benefits and are flexible and efficient for performing tensor computations. Further, we propose an optimization that improves data reuse and reduces redundant or unnecessary computations in tensor decomposition algorithms. Furthermore, we couple our data reuse optimization and the benefits of our sparse tensor storage formats to provide a memory-efficient scalable solution for handling large-scale sparse tensor computations. We demonstrate improved performance and address memory scalability using our techniques on both synthetic small data sets and large-scale sparse real data sets.

Explore More