Mahesh Ravishankar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mahesh Ravishankar is active.

Explore More

Publication

Featured researches published by Mahesh Ravishankar.

international parallel and distributed processing symposium | 2010

Optimal loop unrolling for GPGPU programs

Giridhar Sreenivasa Murthy; Mahesh Ravishankar; Muthu Manikandan Baskaran; P. Sadayappan

Graphics Processing Units (GPUs) are massively parallel, many-core processors with tremendous computational power and very high memory bandwidth. With the advent of general purpose programming models such as NVIDIAs CUDA and the new standard OpenCL, general purpose programming using GPUs (GPGPU) has become very popular. However, the GPU architecture and programming model have brought along with it many new challenges and opportunities for compiler optimizations. One such classical optimization is loop unrolling. Current GPU compilers perform limited loop unrolling. In this paper, we attempt to understand the impact of loop unrolling on GPGPU programs. We develop a semi-automatic, compile-time approach for identifying optimal unroll factors for suitable loops in GPGPU programs. In addition, we propose techniques for reducing the number of unroll factors evaluated, based on the characteristics of the program being compiled and the device being compiled to. We use these techniques to evaluate the effect of loop unrolling on a range of GPGPU programs and show that we correctly identify the optimal unroll factors. The optimized versions run up to 70% faster than the unoptimized versions.

programming language design and implementation | 2012

Dynamic trace-based analysis of vectorization potential of applications

Justin Holewinski; Ragavendar Ramamurthi; Mahesh Ravishankar; Naznin Fauzia; Louis-Noël Pouchet; Atanas Rountev; P. Sadayappan

Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast majority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, Intel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectorization capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes. In this paper we develop an approach to infer a programs SIMD parallelization potential by analyzing the dynamic data-dependence graph derived from a sequential execution trace. By considering only the observed run-time data dependences for the trace, and by relaxing the execution order of operations to allow any dependence-preserving reordering, we can detect potential SIMD parallelism that may otherwise be missed by more conservative compile-time analyses. We show that for several benchmarks our tool discovers regions of code within computationally-intensive loops that exhibit high potential for SIMD parallelism but are not vectorized by state-of-the-art compilers. We present several case studies of the use of the tool, both in identifying opportunities to enhance the transformation capabilities of vectorizing compilers, as well as in pointing to code regions to manually modify in order to enable auto-vectorization and performance improvement by existing compilers.

ieee international conference on high performance computing data and analytics | 2012

Code generation for parallel execution of a class of irregular loops on distributed memory systems

Mahesh Ravishankar; John Eisenlohr; Louis-Noël Pouchet; J. Ramanujam; Atanas Rountev; P. Sadayappan

Parallelization and locality optimization of affine loop nests has been successfully addressed for shared-memory machines. However, many large-scale simulation applications must be executed in a distributed-memory environment, and use irregular/sparse computations where the control-flow and array-access patterns are data-dependent. In this paper, we propose an approach for effective parallel execution of a class of irregular loop computations in a distributed-memory environment, using a combination of static and runtime analysis. We discuss algorithms that analyze sequential code to generate an inspector and an executor. The inspector captures the data-dependent behavior of the computation in parallel and without requiring complete replication of any of the data structures used in the original computation. The executor performs the computation in parallel. The effectiveness of the framework is demonstrated on several benchmarks and a climate modeling application.

Journal of Heat Transfer-transactions of The Asme | 2010

Finite-Volume Formulation and Solution of the P3 Equations of Radiative Transfer on Unstructured Meshes

Mahesh Ravishankar; Sandip Mazumder; Ankan Kumar

The method of spherical harmonics (or P N ) is a popular method for approximate solution of the radiative transfer equation (RTE) in participating media. A rigorous conservative finite-volume (FV) procedure is presented for discretization of the P 3 equations of radiative transfer in two-dimensional geometry—a set of four coupled, second-order partial differential equations. The FV procedure presented here is applicable to any arbitrary unstructured mesh topology. The resulting coupled set of discrete algebraic equations are solved implicitly using a coupled solver that involves decomposition of the computational domain into groups of geometrically contiguous cells using the binary spatial partitioning algorithm, followed by fully implicit coupled solution within each cell group using a preconditioned generalized minimum residual solver. The RTE solver is first verified by comparing predicted results with published Monte Carlo (MC) results for two benchmark problems. For completeness, results using the P 1 approximation are also presented. As expected, results agree well with MC results for large/intermediate optical thicknesses, and the discrepancy between MC and P 3 results increase as the optical thickness is decreased. The P 3 approximation is found to be more accurate than the P 1 approximation for optically thick cases. Finally the new RTE solver is coupled to a reacting flow code and demonstrated for a laminar flame calculation using an unstructured mesh. It is,found that the solution of the four P 3 equations requires 14.5% additional CPU time, while the solution of the single P 1 equation requires 9.3% additional CPU time over the case without radiation.

acm sigplan symposium on principles and practice of parallel programming | 2015

Forma: a DSL for image processing applications to target GPUs and multi-core CPUs

Mahesh Ravishankar; Justin Holewinski; Vinod Grover

As architectures evolve, optimization techniques to obtain good performance evolve as well. Using low-level programming languages like C/C++ typically results in architecture-specific optimization techniques getting entangled with the application specification. In such situations, moving from one target architecture to another usually requires a reimplementation of the entire application. Further, several compiler transformations are rendered ineffective due to implementation choices. Domain-Specific Languages (DSL) tackle both these issues by allowing developers to specify the computation at a high level, allowing the compiler to handle many tedious and error-prone tasks, while generating efficient code for multiple target architectures at the same time. Here we present Forma, a DSL for image processing applications that targets both CPUs and GPUs. The language provides syntax to express several operations like stencils, sampling, etc. which are commonly used in this domain. These can be chained together to specify complex pipelines in a concise manner. The Forma compiler is in charge of tedious tasks like memory management, data transfers from host to device, handling boundary conditions, etc. The high-level description allows the compiler to generate efficient code through use of compile-time analysis and by taking advantage of hardware resources, like texture memory on GPUs. The ease with which complex pipelines can be specified in Forma is demonstrated through several examples. The efficiency of the generated code is evaluated through comparison with a state-of-the-art DSL that targets the same domain, Halide. Our experimental result show that using Forma allows developers to obtain comparable performance on both CPU and GPU with lesser programmer effort. We also show how Forma could be easily integrated with widely used productivity tools like Python and OpenCV. Such an integration would allow users of such tools to develop efficient implementations easily.

acm sigplan symposium on principles and practice of parallel programming | 2015

Distributed memory code generation for mixed Irregular/Regular computations

Mahesh Ravishankar; Roshan Dathathri; Venmugil Elango; Louis-Noël Pouchet; J. Ramanujam; Atanas Rountev; P. Sadayappan

Many applications feature a mix of irregular and regular computational structures. For example, codes using adaptive mesh refinement (AMR) typically use a collection of regular blocks, where the number of blocks and the relationship between blocks is irregular. The computational structure in such applications generally involves regular (affine) loop computations within some number of innermost loops, while outer loops exhibit irregularity due to data-dependent control flow and indirect array access patterns. Prior approaches to distributed memory parallelization do not handle such computations effectively. They either target loop nests that are completely affine using polyhedral frameworks, or treat all loops as irregular. Consequently, the generated distributed memory code contains artifacts that disrupt the regular nature of previously affine innermost loops of the computation. This hampers subsequent optimizations to improve on-node performance. We propose a code generation framework that can effectively transform such applications for execution on distributed memory systems. Our approach generates distributed memory code which preserves program properties that enable subsequent polyhederal optimizations. Simultaneously, it addresses a major memory bottleneck of prior techniques that limits the scalability of the generated code. The effectiveness of the proposed framework is demonstrated on computations that are mixed regular/irregular, completely regular, and completely irregular.

acm sigplan symposium on principles and practice of parallel programming | 2016

Effective resource management for enhancing performance of 2D and 3D stencils on GPUs

Prashant Singh Rawat; Changwan Hong; Mahesh Ravishankar; Vinod Grover; Louis-Noël Pouchet; P. Sadayappan

GPUs are an attractive target for data parallel stencil computations prevalent in scientific computing and image processing applications. Many tiling schemes, such as overlapped tiling and split tiling, have been proposed in past to improve the performance of stencil computations. While effective for 2D stencils, these techniques do not achieve the desired improvements for 3D stencils due to the hardware constraints of GPU. A major challenge in optimizing stencil computations is to effectively utilize all resources available on the GPU. In this paper we develop a tiling strategy that makes better use of resources like shared memory and register file available on the hardware. We present a systematic methodology to reason about which strategy should be employed for a given stencil and also discuss implementation choices that have a significant effect on the achieved performance. Applying these techniques to various 2D and 3D stencils gives a performance improvement of 200-400% over existing tools that target such computations.

ACM Transactions on Architecture and Code Optimization | 2013

Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential

Naznin Fauzia; Venmugil Elango; Mahesh Ravishankar; J. Ramanujam; Fabrice Rastello; Atanas Rountev; Louis-Noël Pouchet; P. Sadayappan

Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper, while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence-preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence-preserving transformations. The execution trace of a code is analyzed to extract a Computational-Directed Acyclic Graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.

international conference on parallel architectures and compilation techniques | 2016

Resource Conscious Reuse-Driven Tiling for GPUs

Prashant Singh Rawat; Changwan Hong; Mahesh Ravishankar; Vinod Grover; Louis-Noël Pouchet; Atanas Rountev; P. Sadayappan

Computations involving successive application of 3D stencil operators are widely used in many application domains, such as image processing, computational electromagnetics, seismic processing, and climate modeling. Enhancement of temporal and spatial locality via tiling is generally required in order to overcome performance bottlenecks due to limited bandwidth to global memory on GPUs. However, the low shared memory capacity on current GPU architectures makes effective tiling for 3D stencils very challenging - several previous domain-specific compilers for stencils have demonstrated very high performance for 2D stencils, but much lower performance on 3D stencils. In this paper, we develop an effective resource-constraint-driven approach for automated GPU code generation for stencils. We present a fusion technique that judiciously fuses stencil computations to minimize data movement, while controlling computational redundancy and maximizing resource usage. The fusion model subsumes time tiling of iterated stencils, and can be easily adapted to different GPU architectures. We integrate the fusion model into a code generator that makes effective use of scarce shared memory and registers to achieve high performance. The effectiveness of the automated model-driven code generator is demonstrated through experimental results on a number of benchmarks, comparing against various previously developed GPU code generators.

Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming | 2015

Fusing convolution kernels through tiling

Mahesh Ravishankar; Paulius Micikevicius; Vinod Grover

Image processing pipelines are continuously being developed to deduce more information about objects captured in images. To facilitate the development of such pipelines several Domain Specific Languages (DSLs) have been proposed that provide constructs for easy specification of such computations. It is then upto the DSL compiler to generate code to efficiently execute the pipeline on multiple hardware architectures. While such compilers are getting ever more sophisticated, to achieve large scale adoption these DSLs have to beat, or at least match, the performance that can be achieved by a skilled programmer. Many of these pipelines use a sequence of convolution kernels that are memory bandwidth bound. One way to address this bottleneck is through use of tiling. In this paper we describe an approach to tiling within the context of a DSL called Forma. Using the high-level specification of the pipeline in this DSL, we describe a code generation algorithm that fuses multiple stages of the pipeline through the use of tiling to reduce the memory bandwidth requirements on both GPU and CPU. Using this technique improves the performance of pipelines like Canny Edge Detection by 58% on NVIDIA GPUs, and of the Harris Corner Detection pipeline by 71% on CPUs.

Explore More