Is this you? Create Your Porfile

Kishore Kothapalli

International Institute of Information Technology, Hyderabad

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kishore Kothapalli is active.

Explore More

Publication

Featured researches published by Kishore Kothapalli.

ieee international conference on high performance computing, data, and analytics | 2009

A performance prediction model for the CUDA GPGPU platform

Kishore Kothapalli; Rishabh Mukherjee; M. Suhail Rehman; Suryakant Patidar; P. J. Narayanan; Kannan Srinathan

The significant growth in computational power of modern Graphics Processing Units (GPUs) coupled with the advent of general purpose programming environments like NVIDIAs CUDA, has seen GPUs emerging as a very popular parallel computing platform. Till recently, there has not been a performance model for GPGPUs. The absence of such a model makes it difficult to definitively assess the suitability of the GPU for solving a particular problem and is a significant impediment to the mainstream adoption of GPUs as a massively parallel (super)computing platform. In this paper we present a performance prediction model for the CUDA GPGPU platform. This model encompasses the various facets of the GPU architecture like scheduling, memory hierarchy, and pipelining among others. We also perform experiments that demonstrate the effects of various memory access strategies. The proposed model can be used to analyze pseudo code for a CUDA kernel to obtain a performance estimate, in a way that is similar to performing asymptotic analysis. We illustrate the usage of our model and its accuracy with three case studies: matrix multiplication, list ranking, and histogram generation.

acm symposium on parallel algorithms and architectures | 2004

Pagoda: a dynamic overlay network for routing, data management, and multicasting

Ankur Bhargava; Kishore Kothapalli; Chris Riley; Christian Scheideler; Mark Thober

The tremendous growth of public interest in peer-to-peer systems in recent years has initiated a lot of research work on how to design efficient and robust overlay networks for these systems. While a large collection of scalable peer-to-peer overlay networks has been proposed in recent years, many fundamental questions have remained open. Some of these are: Is it possible to design deterministic peer-to-peer overlay networks with properties comparable to randomized peer-to-peer systems? How can peers of non-uniform bandwidth be organized in an overlay network?We propose a dynamic overlay network called Pagoda that provides solutions to both of these problems. The Pagoda network has a constant degree, a logarithmic diameter, and a 1/logarithmic expansion, and therefore matches the properties of the best randomized overlay networks known so far. However, in contrast to these networks, the Pagoda is deterministic and therefore guarantees these properties. The Pagoda can be used to organize both nodes with uniform bandwidth and nodes with non-uniform bandwidth. For nodes with uniform bandwidth, any node insertion or deletion can be executed with logarithmic work, and for nodes with non-uniform bandwidth, any node insertion and deletion can be executed with polylogarithmic work. Moreover, the Pagoda overlay network can route arbitrary multicast problems with a congestion that is within a logarithmic factor of what a best possible overlay network of logarithmic degree for that particular multicast problem can achieve, even though the Pagoda is a constant degree network. This holds even for nodes of arbitrary non-uniform bandwidths. We also show that the Pagoda network can be used for efficient data management.

acm symposium on parallel algorithms and architectures | 2005

Constant density spanners for wireless ad-hoc networks

Kishore Kothapalli; Christian Scheideler; Melih Onus; Andréa W. Richa

An important problem for wireless ad hoc networks has been todesign overlay networks that allow time- and energy-efficientrouting. Many local-control strategies for maintaining such overlaynetworks have already been suggested, but most of them are based onan oversimplified wireless communication model. In this paper, we suggest a model that is much more general thanprevious models. It allows the path loss of transmissions tosignificantly deviate from the idealistic unit disk model and doesnot even require the path loss to form a metric. Also, our model isapparently the first proposed for algorithm design that does notonly model transmission and interference issues but also aims atproviding a realistic model for physical carrier sensing. Physicalcarrier sensing is needed so that our protocols do not requireany prior information (not even an estimate onthe number of nodes) about the wireless network to runefficiently. Based on this model, we propose a local-control protocol forestablishing a constant density spanner among a set of mobilestations (or nodes) that are distributed in anarbitrary way in a 2-dimensional Euclidean space. More precisely,we establish a backbone structure by efficiently electing clusterleaders and gateway nodes so that there is only a constant numberof cluster leaders and gateway nodes within the transmission rangeof any node and the backbone structure satisfies the properties ofa topological spanner. Our protocol has the advantage that it is locallyself-stabilizing, i.e., it can recover from anyinitial configuration, even if adversarial nodes participate in it,as long as the honest nodes sufficiently far away from adversarialnodes can in principle form a single connected component.Furthermore, we only need constant size messages and a constantamount of storage at the nodes, irrespective of the distribution ofthe nodes. Hence, our protocols would even work in extremesituations such as very simple wireless devices (like sensors) in ahostile environment.

international conference on supercomputing | 2009

Fast and scalable list ranking on the GPU

M. Suhail Rehman; Kishore Kothapalli; P. J. Narayanan

General purpose programming on the graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. The GPUs have been used extensively on regular problems that can be easily parallelized. In this paper, we describe two implementations of List Ranking, a traditional irregular algorithm that is difficult to parallelize on such massively multi-threaded hardware. We first present an implementation of Wyllies algorithm based on pointer jumping. This technique does not scale well to large lists due to the suboptimal work done. We then present a GPU-optimized, Recursive Helman-JaJa (RHJ) algorithm. Our RHJ implementation can rank a random list of 32 million elements in about a second and achieves a speedup of about 8-9 over a CPU implementation as well as a speedup of 3-4 over the best reported implementation on the Cell Broadband engine. We also discuss the practical issues relating to the implementation of irregular algorithms on massively multi-threaded architectures like that of the GPU. Regular or coalesced memory accesses pattern and balanced load are critical to achieve good performance on the GPU.

international symposium on parallel architectures algorithms and networks | 2005

Efficient broadcasting and gathering in wireless ad-hoc networks

Melih Onus; Andréa W. Richa; Kishore Kothapalli; Christian Scheideler

This paper considers the problem of broadcasting and information gathering in wireless ad-hoc networks, i.e. in wireless networks without any infrastructure in addition to the mobile hosts. Broadcasting is the problem of sending a packet from a source node in the network to all other nodes in the network. Information gathering is the problem of sending one packet each from a subset of the nodes to a single sink node in the network. Most of the proposed theoretical wireless network models oversimplify wireless communication properties. We use a model that takes into account that nodes have different transmission and interference ranges, and we propose algorithms in this model that achieve a high time and work-efficiency. We present algorithms for broadcasting a single or multiple message(s), and for information gathering. Our algorithms have the advantage that they are very simple and self-stabilizing, and would therefore even work in a dynamic environment. Also, our algorithms require only a constant amount of storage at any host. Thus, our algorithms can be used in wireless systems with very simple devices, such as sensors.

ieee international conference on high performance computing, data, and analytics | 2013

Work efficient parallel algorithms for large graph exploration

Dip Sankar Banerjee; Shashank Sharma; Kishore Kothapalli

Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallel architectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-ofthe-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.

Parallel Processing Letters | 2010

SOME GPU ALGORITHMS FOR GRAPH CONNECTED COMPONENTS AND SPANNING TREE

Jyothish Soman; Kishore Kothapalli; P. J. Narayanan

Graphics Processing Units (GPU) are application specific accelerators which provide high performance to cost ratio and are widely available and used, hence places them as a ubiquitous accelerator. A computing paradigm based on the same is the general purpose computing on the GPU (GPGPU) model. The GPU due to its graphics lineage is better suited for the data-parallel, data-regular algorithms. The hardware architecture of the GPU is not suitable for the data parallel but data irregular algorithms such as graph connected components and list ranking. In this paper, we present results that show how to use GPUs efficiently for graph algorithms which are known to have irregular data access patterns. We consider two fundamental graph problems: finding the connected components and finding a spanning tree. These two problems find applications in several graph theoretical problems. In this paper we arrive at efficient GPU implementations for the above two problems. The algorithms focus on minimising irregularity at both algorithmic and implementation level. Our implementation achieves a speedup of 11-16 times over a corresponding best sequential implementation.

international conference of distributed computing and networking | 2013

On the Analysis of a Label Propagation Algorithm for Community Detection

Kishore Kothapalli; Sriram V. Pemmaraju; Vivek B. Sardeshmukh

This paper initiates formal analysis of a simple, distributed algorithm for community detection on networks. We analyze an algorithm that we call Max-LPA, both in terms of its convergence time and in terms of the “quality” of the communities detected. Max-LPA is an instance of a class of community detection algorithms called label propagation algorithms. As far as we know, most analysis of label propagation algorithms thus far has been empirical in nature and in this paper we seek a theoretical understanding of label propagation algorithms. In our main result, we define a clustered version of Erdos-Renyi random graphs with clusters V 1, V 2, …, V k where the probability p, of an edge connecting nodes within a cluster V i is higher than p′, the probability of an edge connecting nodes in distinct clusters. We show that even with fairly general restrictions on p and p′ (\(p = \Omega\left(\frac{1}{n^{1/4-\epsilon}}\right)\) for any e > 0, p′ = O(p 2), where n is the number of nodes), Max-LPA detects the clusters V 1, V 2, …, V n in just two rounds. Based on this and on empirical results, we conjecture that Max-LPA can correctly and quickly identify communities on clustered Erdos-Renyi graphs even when the clusters are much sparser, i.e., with \(p = \frac{c\log n}{n}\) for some c > 1.

ieee international conference on high performance computing, data, and analytics | 2012

Sparse matrix-matrix multiplication on modern architectures

Kiran Kumar Matam; Siva Rama Krishna Bharadwaj Indarapu; Kishore Kothapalli

Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, respectively, among other applications find usage in various matrix formulations of graph problems. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting the output size. It is even more challenging when combined with the GPU architectural constraints such as memory accesses, limited shared memory, strict SIMD and thread execution. To address these challenges on a GPU, we evaluate three possible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations targeted at sparse matrices. Our experiments indicate that the Row-Row formulation, which mostly outperforms the other formulations, is 3.5x faster on average compared to an optimized multi-core implementation in the Intel MKL library. We extend the Row-Row formulation to a CPU+GPU hybrid algorithm that simultaneously utilizes the CPU also. In this direction, we present heuristics to find the right amount of work division between the CPU and the GPU. Our hybrid row-row formulation of the spgemm operation performs 5.5x faster on average when compared to the optimized multi-core implementation in the Intel MKL library. Our experience indicates that it is difficult to identify right amount of work division between the CPU and the GPU. We therefore investigate a subclass of sparse matrices, band matrices, and present an analytical method to identify a good work division when multiplying two band matrices. Our GPU csrmm operation performs 2.5x faster on average when compared to a corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.

ieee international conference on high performance computing, data, and analytics | 2011

Hybrid algorithms for list ranking and graph connected components

Dip Sankar Banerjee; Kishore Kothapalli

The advent of multicore and many-core architectures saw them being deployed to speed-up computations across several disciplines and application areas. Prominent examples include semi-numerical algorithms such as sorting, graph algorithms, image processing, scientific computations, and the like. In particular, using GPUs for general purpose computations has attracted a lot of attention given that GPUs can deliver more than one TFLOP of computing power at very low prices. In this work, we use a new model of multicore computing called hybrid multicore computing where the computation is performed simultaneously a control device, such as a CPU, and an accelerator such as a GPU. To this end, we use two case studies to explore the algorithmic and analytical issues in hybrid multicore computing. Our case studies involve two different ways of designing hybrid multicore algorithms. The main contribution of this paper is to address the issues related to the design of hybrid solutions. We show our hybrid algorithm for list ranking is faster by 50% compared to the best known implementation [Z. Wei, J. JaJa; IPDPS 2010]. Similarly, our hybrid algorithm for graph connected components is faster by 25% compared to the best known GPU implementation [26].

Explore More