Aravind Sukumaran-Rajam

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aravind Sukumaran-Rajam is active.

Explore More

Publication

Featured researches published by Aravind Sukumaran-Rajam.

acm sigplan symposium on principles and practice of parallel programming | 2017

Parallel CCD++ on GPU for Matrix Factorization

Israt Nisa; Aravind Sukumaran-Rajam; Rakshith Kunchum; P. Sadayappan

Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).

international conference on supercomputing | 2017

On improving performance of sparse matrix-matrix multiplication on GPUs

Rakshith Kunchum; Ankur Chaudhry; Aravind Sukumaran-Rajam; Qingpeng Niu; Israt Nisa; P. Sadayappan

Sparse matrix-matrix multiplication (SpGEMM) is an important primitive for many data analytics algorithms, such as Markov clustering. Unlike the dense case, where performance of matrix-matrix multiplication is considerably higher than matrix-vector multiplication, the opposite is true for the sparse case on GPUs. A significant challenge is that the sparsity structure of the output sparse matrix is not known a priori, and many additive contributions must be combined to generate its non-zero elements. We use synthetic matrices to characterize the effectiveness of alternate approaches and devise a hybrid approach that is demonstrated to be consistently superior to other available GPU SpGEMM implementations.

high performance distributed computing | 2018

Efficient sparse-matrix multi-vector product on GPUs

Changwan Hong; Aravind Sukumaran-Rajam; Bortik Bandyopadhyay; Jinsung Kim; Süreyya Emre Kurt; Israt Nisa; Shivani Sabhlok; Srinivasan Parthasarathy; P. Sadayappan

Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high performance on sparse computations on GPUs is very challenging. A tremendous amount of recent research has focused on various GPU implementations of the SpMV kernel. But the multi-vector SpMM kernel has received much less attention. In this paper, we present an in-depth analysis to contrast SpMV and SpMM, and develop a new sparse-matrix representation and computation approach suited to achieving high data-movement efficiency and effective GPU parallelization of SpMM. Experimental evaluation using the entire SuiteSparse matrix suite demonstrates significant performance improvement over existing SpMM implementations from vendor libraries.

international conference on parallel architectures and compilation techniques | 2017

MultiGraph: Efficient Graph Processing on GPUs

Changwan Hong; Aravind Sukumaran-Rajam; Jinsung Kim; P. Sadayappan

High-level GPU graph processing frameworks are an attractive alternative for achieving both high productivity and high performance. Hence, several high-level frameworks for graph processing on GPUs have been developed. In this paper, we develop an approach to graph processing on GPUs that seeks to overcome some of the performance limitations of existing frameworks. It uses multiple data representation and execution strategies for dense versus sparse vertex frontiers, dependent on the fraction of active graph vertices. A two-phase edge processing approach trades off extra data movement for improved load balancing across GPU threads, by using a 2D blocked representation for edge data. Experimental results demonstrate performance improvement over current state-of-the-art GPU graph processing frameworks for many benchmark programs and data sets.

programming language design and implementation | 2018

GPU code optimization using abstract kernel emulation and sensitivity analysis

Changwan Hong; Aravind Sukumaran-Rajam; Jinsung Kim; Prashant Singh Rawat; Sriram Krishnamoorthy; Louis-Noël Pouchet; Fabrice Rastello; P. Sadayappan

In this paper, we develop an approach to GPU kernel optimization by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck. Performance modeling for GPUs is done by abstract kernel emulation along with latency/gap modeling of resources. Sensitivity analysis with respect to resource latency/gap parameters is used to predict the bottleneck resource for a given kernels execution. The utility of the bottleneck analysis is demonstrated in two contexts: 1) Coupling the new bottleneck-driven optimization strategy with the OpenTuner auto-tuner: experimental results on all kernels from the Rodinia suite and GPU tensor contraction kernels from the NWChem computational chemistry suite demonstrate effectiveness. 2) Manual code optimization: two case studies illustrate the use of the bottleneck analysis to iteratively improve the performance of code from state-of-the-art domain-specific code generators.

international conference on supercomputing | 2018

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

Jinsung Kim; Aravind Sukumaran-Rajam; Changwan Hong; Ajay Panyala; Rohit Kumar Srivastava; Sriram Krishnamoorthy; P. Sadayappan

Tensor contractions are higher dimensional analogs of matrix multiplications, used in many computational contexts such as high order models in quantum chemistry, deep learning, finite element methods etc. In contrast to the wide availability of high-performance libraries for matrix multiplication on GPUs, the same is not true for tensor contractions. In this paper, we address the optimization of a set of symmetrized tensor contractions that form the computational bottleneck in the CCSD(T) coupled-cluster method in computational chemistry suites like NWChem. Some of the challenges in optimizing tensor contractions that arise in practice from the variety of dimensionalities and shapes for tensors include effective mapping of the high-dimensional iteration space to threads, choice of data buffering in shared-memory and registers, and tile sizes for multi-level tiling. Furthermore, in the case of symmetrized tensor contractions in CCSD(T), it is also a challenge to fuse contractions to reduce data movement cost by exploiting reuse of intermediate tensors. In this paper, we develop an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose. Experimental results demonstrate significant improvement over the current state-of-the-art.

international conference on computational science | 2018

Parallel Latent Dirichlet Allocation on GPUs

Gordon E. Moon; Israt Nisa; Aravind Sukumaran-Rajam; Bortik Bandyopadhyay; Srinivasan Parthasarathy; P. Sadayappan

Latent Dirichlet Allocation (LDA) is a statistical technique for topic modeling. Since it is very computationally demanding, its parallelization has garnered considerable interest. In this paper, we systematically analyze the data access patterns for LDA and devise suitable algorithmic adaptations and parallelization strategies for GPUs. Experiments on large-scale datasets show the effectiveness of the new parallel implementation on GPUs.

acm sigplan symposium on principles and practice of parallel programming | 2018

Register optimizations for stencils on GPUs

Prashant Singh Rawat; Fabrice Rastello; Aravind Sukumaran-Rajam; Louis-Noël Pouchet; Atanas Rountev; P. Sadayappan

The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes.

acm sigplan symposium on principles and practice of parallel programming | 2018

Performance modeling for GPUs using abstract kernel emulation

Changwan Hong; Aravind Sukumaran-Rajam; Jinsung Kim; Prashant Singh Rawat; Sriram Krishnamoorthy; Louis-Noël Pouchet; Fabrice Rastello; P. Sadayappan

Performance modeling of GPU kernels is a significant challenge. In this paper, we develop a novel approach to performance modeling for GPUs through abstract kernel emulation along with latency/gap modeling of resources. Experimental results on all benchmarks from the Rodinia suite demonstrate good accuracy in predicting execution time on multiple GPU platforms.

international conference on parallel architectures and compilation techniques | 2017

POSTER: Statement Reordering to Alleviate Register Pressure for Stencils on GPUs

Prashant Singh Rawat; Aravind Sukumaran-Rajam; Atanas Rountev; Fabrice Rastello; Louis-Noël Pouchet; P. Sadayappan

Compute-intensive GPU architectures allow the use of high-order 3D stencils for better computational accuracy. These stencils are usually compute-bound. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in a sub-optimal code with a large number of register spills. We develop an optimization framework that models stencils as a forest of trees and performs statement reordering to reduce register use. The effectiveness of the approach is demonstrated through experimental results on several high-order stencils.

Explore More