Dip Sankar Banerjee
International Institute of Information Technology, Hyderabad
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dip Sankar Banerjee.
ieee international conference on high performance computing, data, and analytics | 2013
Dip Sankar Banerjee; Shashank Sharma; Kishore Kothapalli
Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallel architectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-ofthe-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.
ieee international conference on high performance computing, data, and analytics | 2011
Dip Sankar Banerjee; Kishore Kothapalli
The advent of multicore and many-core architectures saw them being deployed to speed-up computations across several disciplines and application areas. Prominent examples include semi-numerical algorithms such as sorting, graph algorithms, image processing, scientific computations, and the like. In particular, using GPUs for general purpose computations has attracted a lot of attention given that GPUs can deliver more than one TFLOP of computing power at very low prices. In this work, we use a new model of multicore computing called hybrid multicore computing where the computation is performed simultaneously a control device, such as a CPU, and an accelerator such as a GPU. To this end, we use two case studies to explore the algorithmic and analytical issues in hybrid multicore computing. Our case studies involve two different ways of designing hybrid multicore algorithms. The main contribution of this paper is to address the issues related to the design of hybrid solutions. We show our hybrid algorithm for list ranking is faster by 50% compared to the best known implementation [Z. Wei, J. JaJa; IPDPS 2010]. Similarly, our hybrid algorithm for graph connected components is faster by 25% compared to the best known GPU implementation [26].
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013
Dip Sankar Banerjee; Parikshit Sakurikar; Kishore Kothapalli
Sorting has been a topic of immense research value since the inception of Computer Science. Hybrid computing on multicore architectures involves computing simultaneously on a tightly coupled heterogeneous collection of devices. In this work, we consider a multicore CPU along with a many core GPU as our experimental hybrid platform. In this work, we present a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting. The algorithm is broadly based on splitting the input list according to a large number of splitters followed by creating independent sub lists. Sorting the independent sub lists results in sorting the entire original list. On a CPU+GPU platform consisting of an Intel i7 980 and an Nvidia GTX 580, our algorithm achieves a 20% gain over the current best known comparison sort result that was published by Davidson et. al. [In Par 2012]. On the above experimental platform, our results are better by 40% on average over a similar GPU-alone algorithm proposed by Leischner et. al. [IPDPS 2010]. Our results also show that our algorithm and its implementation scale with the size of the input. We also show that such performance gains can be obtained on other hybrid CPU+GPU platforms.
Journal of Parallel and Distributed Computing | 2015
Dip Sankar Banerjee; Ashutosh Kumar; Meher Chaitanya; Shashank Sharma; Kishore Kothapalli
Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallel architectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-ofthe-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.
international parallel and distributed processing symposium | 2012
Dip Sankar Banerjee; Aman Kumar Bahl; Kishore Kothapalli
The use of manycore architectures and accelerators, such as GPUs, with good programmability has allowed them to be deployed for vital computational work. The ability to use randomness in computation is known to help in several situations. For such computations to be made possible on a general purpose computer, a source of randomness, or in general a pseudo random generator (PRNG), is essential. However, most of the PRNGs currently available on GPUs suffer from some basic drawbacks that we highlight in this paper. It is of high interest therefore to develop a parallel, quality PRNG that also works in an on demand model. In this paper we investigate a CPU+GPU hybrid technique to create an efficient PRNG. The basic technique we apply is that of random walks on expander graphs. Unlike existing generators available in the GPU programming environment, our generator can produce random numbers on demand as opposed to a onetime generation. Our approach produces 0.07 GNumbers per second. The quality of our generator is tested with industry standard tests. We also demonstrate two applications of our PRNG. We apply our PRNG to design a list ranking algorithm which demonstrates the on-demand nature of the algorithm and a Monte Carlo simulation which shows the high quality of our generator.
ieee international conference on high performance computing data and analytics | 2014
Dip Sankar Banerjee; Parikshit Sakurikar; Kishore Kothapalli
Sorting has been a topic of immense research value since the inception of computer science. Hybrid computing on multicore architectures involves computing simultaneously on a tightly coupled heterogeneous collection of devices. In this work, we consider a multicore CPU along with a manycore GPU as our experimental hybrid platform. In this work, we present a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting. The algorithm is broadly based on splitting the input list according to a large number of splitters followed by creating independent sublists. Sorting the independent sublists results in sorting the entire original list. On a CPU + GPU platform consisting of an Intel i7-980X and an NVidia GTX 580, our algorithm achieves a 20% gain over the current best known comparison sort result that was published by (Davidson et al., 2012). On the above experimental platform, our results are better by 40% on average over a similar GPU-alone algorithm proposed by (Leischner et al., 2010). We also extend our sorting algorithm for fixed length keys to variable length keys. We use a look-ahead based approach to sort strings and obtain around a 24% benefit compared to the current best known solution. Our results also show that our algorithm and its implementation scale with the size of the input. We also show that such performance gains can be obtained on other hybrid CPU + GPU platforms.
ieee international conference on cloud computing technology and science | 2016
Dip Sankar Banerjee; Khaled Hamidouche; Dhabaleswar K. Panda
Deep learning frameworks have recently gained widespread popularity due to their highly accurate prediction capabilities and availability of low cost processors that can perform training over a large dataset quickly. Given the high core count in modern generation high performance computing systems, training deep networks over large data has now become practical. In this work, while targeting the Computational Network Toolkit (CNTK) framework, we propose new mechanisms and designs to boost the performance of the communications between GPU nodes. We perform thorough analysis of the different phases of the toolkit such as I/O, communications, and computation of CNTK to identify the different bottlenecks that can be potentially alleviated using the high performance capabilities provided by many CUDA aware MPI runtimes. Using a CUDA aware MPI library, we propose CUDA Aware CNTK (CA-CNTK) which does low overhead communications. Different datasets ranging from small to large sizes prove the advantage of our re-design, and how it can show similar results on deep learning frameworks having a similar execution pattern. Our designs show an average improvement of 23%, 21% and 15% per epoch for the popular CIFAR10, MNIST and ImageNet datasets, respectively.
acm sigplan symposium on principles and practice of parallel programming | 2016
Dip Sankar Banerjee; Khaled Hamidouche; Dhabaleswar K. Panda
Graphics Processing Units (GPUs) have gained the position of a main stream accelerator due to its low power footprint and massive parallelism. CUDA 6.0 onward, NVIDIA has introduced the Managed Memory capability which unifies the host and device memory allocations into a single allocation and removes the requirement for explicit memory transfers between either memories. Several applications particularly of irregular nature can have immense benefits from managed memory because of the high productivity in programming that can be achieved owing to the minimal effort involved in the data management and movement. The MVAPICH2 library utilizes runtime designs such as CUDA Inter Process Communications (IPC) and GPUDirect RDMA (GDR) under the CUDA-Aware concept, to offer high productivity and programmability with MPI on modern clusters. However, integration and interaction of managed memory with these features raises challenges for efficient small and large message communications. In this study, we present an initial evaluation of managed memory capability and its interaction with existing high performance designs and features available in MVAPICH2 library. We propose new designs to enable efficient communication support between managed memory buffers. We also perform fine tuning to optimize the transfers between managed memories residing in GPUs. To the best of our knowledge, this is the first evaluation and study of managed memory and its interaction with MPI runtimes. A detailed evaluation and analysis of the performance of the proposed designs is presented. The Stencil2D communication kernel available in the SHOC suite was re-designed to enable the managed memory support. The evaluation shows a 4x improvement in the timings of stencil exchanges on 16 GPU nodes.
international parallel and distributed processing symposium | 2015
Kiran Raj Ramamoorthy; Dip Sankar Banerjee; Kannan Srinathan; Kishore Kothapalli
Multiplying two sparse matrices, denoted spmm, is a fundamental operation in linear algebra with several applications. Hence, efficient and scalable implementation of spmm has been a topic of immense research. Recent efforts are aimed at implementations on GPUs, multicore architectures, FPGAs, and such emerging computational platforms. Owing to the highly irregular nature of spmm, it is observed that GPUs and CPUs can offer comparable performance. In this paper, we study CPU+GPU heterogeneous algorithms for spmm where the matrices exhibit a scale-free nature. Focusing on such matrices, we propose an algorithm that multiplies two sparse matrices exhibiting scale-free nature on a CPU+GPU heterogeneous platform. Our experiments on a wide variety of real-world matrices from standard datasets show an average of 25% improvement over the best possible algorithm on a CPU+GPU heterogeneous platform. We show that our approach is both architecture-aware, and workload-aware.
international conference on parallel processing | 2017
Mallipeddi Hardhik; Dip Sankar Banerjee; Kiran Raj Ramamoorthy; Kishore Kothapalli; Kannan Srinathan
The architectural trend towards heterogeneity has pushed heterogeneous computing to the fore of parallel computing research. Heterogeneous algorithms, often carefully handcrafted, have been designed for several important problems from parallel computing such as sorting, graph algorithms, matrix computations, and the like. A majority of these algorithms follow a work partitioning approach where the input is divided into appropriate sized parts so that individual devices can process the “right” parts of the input. However, arriving at a good work partitioning is usually non-trivial and may require extensive empirical search. Such an extensive empirical search can potentially offset any gains accrued out of heterogeneous algorithms. Other recently proposed approaches too are in general inadequate.In this paper, we propose a simple and effective technique for work partitioning in the context of heterogeneous algorithms. Our technique is based on sampling and therefore can adapt to both the algorithm used and the input instance. Our technique is generic in its applicability as we will demonstrate in this paper. We validate our technique on three problems: finding the connected components of a graph (CC), multiplying two unstructured sparse matrices (spmm), and multiplying two scalefree sparse matrices. For these problems, we show that using our method, we can find the required threshold that is under 10% away from the best possible thresholds.