Mayank Daga | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mayank Daga is active.

Explore More

Publication

Featured researches published by Mayank Daga.

ieee international conference on high performance computing data and analytics | 2014

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

Joseph L. Greathouse; Mayank Daga

The performance of sparse matrix vector multiplication (SpMV) is important to computational scientists. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMV on graphics processing units (GPUs) has poor performance due to irregular memory access patterns, load imbalance, and reduced parallelism. This has led researchers to propose new storage formats. Unfortunately, dynamically transforming CSR into these formats has significant runtime and storage overheads. We propose a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs. Our implementation addresses the aforementioned challenges by (i) efficiently accessing DRAM by streaming data into the local scratchpad memory and (ii) dynamically assigning different numbers of rows to each parallel GPU compute unit. CSR-Adaptive achieves an average speedup of 14.7× over existing CSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.

international conference on parallel and distributed systems | 2011

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Mayank Daga; Thomas R. W. Scogland; Wu-chun Feng

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-dimensional problem that requires deep technical knowledge of GPU architecture. Although substantial literature exists on how to map and optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on an AMD GPU, such as the 1600-core AMD Radeon HD 5870 GPU. Consequently, we present and evaluate architecture-aware mapping and optimizations for the AMD GPU. The most prominent of which include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU mapping and optimizations by applying each in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, our optimized OpenCL implementation on an AMD Radeon HD 5870 delivers more than a four-fold improvement in performance over the basic OpenCL implementation. In addition, it outperforms our optimized CUDA version on an NVIDIA GTX280 by 12%. Overall, we achieve a speedup of 371-fold over a serial but hand-tuned SSE version of our molecular modeling application, and in turn, a 46-fold speedup over an ideal scaling on an 8-core CPU.

ieee international conference on high performance computing data and analytics | 2012

Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU

Mayank Daga; Mark Nutter

B+ tree structured index searches are one of the fundamental database operations and hence, accelerating them is essential. GPUs provide a compelling mix of performance per watt and performance per dollar, and thus are an attractive platform for accelerating B+ tree searches. However, tree search on discrete GPUs presents significant challenges for acceleration due to (i) the irregular representation in memory and (ii) the requirement to copy the tree to the GPU memory over the PCIe bus. In this paper, we present the acceleration of B+ tree searches on a fused CPU+GPU processor (an accelerated processing unit or APU). We counter the aforementioned issues by reorganizing the B+ tree in memory and utilizing the novel heterogeneous system architecture, which eliminates (i) the need to copy the tree to the GPU and (ii) the limitation on the size of the tree that can be accelerated. Our approach exploits the coarse-grained parallelism in tree search, wherein we execute multiple searches in parallel to optimize for the SIMD width without modifying the inherent B+ tree data structure. Our results illustrate that the APU implementation can perform up to 70M1 queries per second and is 4.9x faster in the best case and 2.5x faster on average than a hand-tuned, SSE-optimized, six-core CPU implementation, for varying orders of the B+ tree with 4M keys. We also present an analysis of the effect of caches on performance, and of the efficacy of the APU to eliminate data-copies.

ieee international conference on high performance computing data and analytics | 2015

Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices

Mayank Daga; Joseph L. Greathouse

Sparse matrix vector multiplication (SpMV) is an important linear algebra primitive. Recent research has focused on improving the performance of SpMV on GPUs when using compressed sparse row (CSR), the most frequently used matrix storage format on CPUs. Efficient CSR-based SpMV obviates the need for other GPU-specific storage formats, thereby saving runtime and storage overheads. However, existing CSR-based SpMV algorithms on GPUs perform poorly on irregular sparse matrices, limiting their usefulness. We propose a novel approach for SpMV on GPUs which works well for both regular and irregular matrices while keeping the CSR format intact. We start with CSR-Adaptive, which dynamically chooses between two SpMV algorithms depending on the length of each row. We then add a series of performance improvements, such as a more efficient reduction technique. Finally, we add a third algorithm which uses multiple parallel execution units when operating on irregular matrices with very long rows. Our implementation dynamically assigns the best algorithm to sets of rows in order to ensure that the GPU is efficiently utilized. We effectively double the performance of CSR-Adaptive, which had previously demonstrated better performance than algorithms that use other storage formats. In addition, our implementation is 36% faster than CSR5, the current state of the art for SpMV on GPUs.

international conference on big data | 2014

Efficient breadth-first search on a heterogeneous processor

Mayank Daga; Mark Nutter; Mitesh R. Meswani

Accelerating breadth-first search (BFS) can be a compelling value-add given its pervasive deployment. The current state-of-the-art hybrid BFS algorithm selects different traversal directions based on graph properties, thereby, possessing heterogeneous characteristics. Related work has studied this heterogeneous BFS algorithm on homogeneous processors. In recent years heterogeneous processors have become mainstream due to their ability to maximize performance under restrictive thermal budgets. However, current software fails to fully leverage the heterogeneous capabilities of the modern processor, lagging behind hardware advancements. We propose a “hybrid++” BFS algorithm for an accelerated processing unit (APU), a heterogeneous processor which fuses the CPU and GPU cores on a single die. Hybrid++ leverages the strength of CPUs and GPUs for serial and data-parallel execution, respectively, to carefully partition BFS by selecting the appropriate execution-core and graph-traversal direction for every search iteration. Our results illustrate that on a variety of graphs ranging from social- to road-networks, hybrid++ yields a speedup of up to 2× compared to the multithreaded hybrid algorithm. Execution of hybrid++ on the APU is also 2.3× more energy efficient than that on a discrete GPU.

computing frontiers | 2011

Bounding the effect of partition camping in GPU kernels

Ashwin M. Aji; Mayank Daga; Wu-chun Feng

Current GPU tools and performance models provide some common architectural insights that guide the programmers to write optimal code. We challenge and complement these performance models and tools, by modeling and analyzing a lesser known, but very severe performance pitfall, called Partition Camping, in NVIDIA GPUs. Partition Camping is caused by memory accesses that are skewed towards a subset of the available memory partitions, which may degrade the performance of GPU kernels by up to seven-fold. There is no existing tool that can detect the partition camping effect in GPU kernels. Unlike the traditional performance modeling approaches, we predict a performance range that bounds the partition camping effect in the GPU kernel. Our idea of predicting a performance range, instead of the exact performance, is more realistic due to the large performance variations induced by partition camping. We design and develop the prediction model by first characterizing the effects of partition camping with an indigenous suite of micro-benchmarks. We then apply rigorous statistical regression techniques over the micro-benchmark data to predict the performance bounds of real GPU kernels, with and without the partition camping effect. We test the accuracy of our performance model by analyzing three real applications with known memory access patterns and partition camping effects. Our results show that the geometric mean of errors in our performance range prediction model is within 12% of the actual execution times. We also develop and present a very easy-to-use spreadsheet based tool called CampProf, which is a visual front-end to our performance range prediction model and can be used to gain insights into the degree of partition camping in GPU kernels. Lastly, we demonstrate how CampProf can be used to visually monitor the performance improvements in the kernels, as the partition camping effect is being removed.

Journal of Chemical Theory and Computation | 2011

An n log n Generalized Born Approximation.

Ramu Anandakrishnan; Mayank Daga; Alexey V. Onufriev

Molecular dynamics (MD) simulations based on the generalized Born (GB) model of implicit solvation offer a number of important advantages over the traditional explicit solvent based simulations. Yet, in MD simulations, the GB model has not been able to reach its full potential partly due to its computational cost, which scales as ∼n(2), where n is the number of solute atoms. We present here an ∼n log n approximation for the generalized Born (GB) implicit solvent model. The approximation is based on the hierarchical charge partitioning (HCP) method (Anandakrishnan and Onufriev J. Comput. Chem. 2010 , 31 , 691 - 706 ) previously developed and tested for electrostatic computations in gas-phase and distant dependent dielectric models. The HCP uses the natural organization of biomolecular structures to partition the structures into multiple hierarchical levels of components. The charge distribution for each of these components is approximated by a much smaller number of charges. The approximate charges are then used for computing electrostatic interactions with distant components, while the full set of atomic charges are used for nearby components. To apply the HCP concept to the GB model, we define the equivalent of the effective Born radius for components. The component effective Born radius is then used in GB computations for points that are distant from the component. This HCP approximation for GB (HCP-GB) is implemented in the open source MD software, NAB in AmberTools, and tested on a set of representative biomolecular structures ranging in size from 632 atoms to ∼3 million atoms. For this set of test structures, the HCP-GB method is 1.1-390 times faster than the GB computation without additional approximations (the reference GB computation), depending on the size of the structure. Similar to the spherical cutoff method with GB (cutoff-GB), which also scales as ∼n log n, the HCP-GB is relatively simple. However, for the structures considered here, we show that the HCP-GB method is more accurate than the cutoff-GB method as measured by relative RMS error in electrostatic force compared to the reference (no cutoff) GB computation. MD simulations of four biomolecular structures on 50 ns time scales show that the backbone RMS deviation for the HCP-GB method is in reasonable agreement with the reference GB simulation. A critical difference between the cutoff-GB and HCP-GB methods is that the cutoff-GB method completely ignores interactions due to atoms beyond the cutoff distance, whereas the HCP-GB method uses an approximation for interactions due to distant atoms. Our testing suggests that completely ignoring distant interactions, as the cutoff-GB does, can lead to qualitatively incorrect results. In general, we found that the HCP-GB method reproduces key characteristics of dynamics, such as residue fluctuation, χ1/χ2 flips, and DNA flexibility, more accurately than the cutoff-GB method. As a practical demonstration, the HCP-GB simulation of a 348 000 atom chromatin fiber was used to refine the starting structure. Our findings suggest that the HCP-GB method is preferable to the cutoff-GB method for molecular dynamics based on pairwise implicit solvent GB models.

international conference on computational advances in bio and medical sciences | 2011

Towards accelerating molecular modeling via multi-scale approximation on a GPU

Mayank Daga; Wu-chun Feng; Thomas R. W. Scogland

Research efforts to analyze biomolecular properties contribute towards our understanding of biomolecular function. Calculating non-bonded forces (or in our case, electrostatic surface potential) is often a large portion of the computational complexity in analyzing biomolecular properties. Therefore, reducing the computational complexity of these force calculations, either by improving the computational algorithm or by improving the underlying hardware on which the computational algorithm runs, can help to accelerate the discovery process. Traditional approaches seek to parallelize the electrostatic calculations to run on large-scale supercomputers, which are expensive and highly contended resources. Leveraging our multi-scale approximation algorithm for calculating electrostatic surface potential, we present a novel mapping and optimization of this algorithm on the graphics processing unit (GPU) of a desktop personal computer (PC). Our mapping and optimization of the algorithm results in a speed-up as high as four orders of magnitude, when compared to running serially on the same desktop PC, without deteriorating the accuracy of our results.

acm sigplan symposium on principles and practice of parallel programming | 2016

Implementing directed acyclic graphs with the heterogeneous system architecture

Sooraj Puthoor; Ashwin M. Aji; Shuai Che; Mayank Daga; Wei Wu; Bradford M. Beckmann; Gregory Rodgers

Achieving optimal performance on heterogeneous computing systems requires a programming model that supports the execution of asynchronous, multi-stream, and out-of-order tasks in a shared memory environment. Asynchronous dependency-driven tasking is one such programming model that allows the computation to be expressed as a directed acyclic graph (DAG) and exposes fine-grain task management to the programmer. The use of DAGs to extract parallelism also enables runtimes to perform dynamic load-balancing, thereby achieving higher throughput when compared to the traditional bulk-synchronous execution. However, efficient DAG implementations require features such as user-level task dispatch, hardware signalling and local barriers to achieve low-overhead task dispatch and dependency resolution. In this paper, we demonstrate that the Heterogeneous System Architecture (HSA) exposes the above capabilities, and we validate their benefits by implementing three well-referenced applications using fine-grain tasks: Cholesky factorization, Lower Upper Decomposition (LUD), and Needleman-Wunsch (NW). HSAs user-level task dispatch and signalling capability allow work to be launched and dependencies to be managed directly by the hardware, avoiding inefficient bulk-synchronization. Our results show the HSA task-based implementations of Cholesky, LUD, and NW are representative of this emerging class of workloads and using hardware-managed tasks achieve a speedup of 3.8x, 1.6x, and 1.5x, respectively, compared to bulk-synchronous implementations.

international workshop on opencl | 2016

clSPARSE: A Vendor-Optimized Open-Source Sparse BLAS Library

Joseph L. Greathouse; Kent Knox; Jakub Poła; Kiran Varaganti; Mayank Daga

Sparse linear algebra is a cornerstone of modern computational science. These algorithms ignore the zero-valued entries found in many domains in order to work on much larger problems at much faster rates than dense algorithms. Nonetheless, optimizing these algorithms is not straightforward. Highly optimized algorithms for multiplying a sparse matrix by a dense vector, for instance, are the subject of a vast corpus of research and can be hundreds of times longer than naïve implementations. Optimized sparse linear algebra libraries are thus needed so that users can build applications without enormous effort. Hardware vendors release proprietary libraries that are highly optimized for their devices, but they limit interoperability and promote vendor lock-in. Open libraries often work across multiple devices and can quickly take advantage of new innovations, but they may not reach peak performance. The goal of this work is to provide a sparse linear algebra library that offers both of these advantages. We thus describe clSPARSE, a permissively licensed open-source sparse linear algebra library that offers state-of-the-art optimized algorithms implemented in OpenCL™. We test clSPARSE on GPUs from AMD and Nvidia and show performance benefits over both the proprietary cuSPARSE library and the open-source ViennaCL library.

Explore More