Narayanan Sundaram | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Narayanan Sundaram is active.

Explore More

Publication

Featured researches published by Narayanan Sundaram.

international conference on machine learning | 2008

Fast support vector machine training and classification on graphics processors

Bryan Catanzaro; Narayanan Sundaram; Kurt Keutzer

Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance implementations of machine learning algorithms. We describe a solver for Support Vector Machine training running on a GPU, using the Sequential Minimal Optimization algorithm and an adaptive first and second order working set selection heuristic, which achieves speedups of 9-35x over LIBSVM running on a traditional processor. We also present a GPU-based system for SVM classification which achieves speedups of 81-138x over LIBSVM (5-24x over our own CPU based SVM classifier).

european conference on computer vision | 2010

Dense point trajectories by GPU-accelerated large displacement optical flow

Narayanan Sundaram; Thomas Brox; Kurt Keutzer

Dense and accurate motion tracking is an important requirement for many video feature extraction algorithms. In this paper we provide a method for computing point trajectories based on a fast parallel implementation of a recent optical flow algorithm that tolerates fast motion. The parallel implementation of large displacement optical flow runs about 78× faster than the serial C++ version. This makes it practical to use in a variety of applications, among them point tracking. In the course of obtaining the fast implementation, we also proved that the fixed point matrix obtained in the optical flow technique is positive semi-definite. We compare the point tracking to the most commonly used motion tracker - the KLT tracker - on a number of sequences with ground truth motion. Our resulting technique tracks up to three orders of magnitude more points and is 46% more accurate than the KLT tracker. It also provides a tracking density of 48% and has an occlusion error of 3% compared to a density of 0.1% and occlusion error of 8% for the KLT tracker. Compared to the Particle Video tracker, we achieve 66% better accuracy while retaining the ability to handle large displacements while running an order of magnitude faster.

international conference on computer vision | 2009

Efficient, high-quality image contour detection

Bryan Catanzaro; Bor-Yiing Su; Narayanan Sundaram; Yunsup Lee; Mark Murphy; Kurt Keutzer

Image contour detection is fundamental to many image analysis applications, including image segmentation, object recognition and classification. However, highly accurate image contour detection algorithms are also very computationally intensive, which limits their applicability, even for offline batch processing. In this work, we examine efficient parallel algorithms for performing image contour detection, with particular attention paid to local image analysis as well as the generalized eigensolver used in Normalized Cuts. Combining these algorithms into a contour detector, along with careful implementation on highly parallel, commodity processors from Nvidia, our contour detector provides uncompromised contour accuracy, with an F-metric of 0.70 on the Berkeley Segmentation Dataset. Runtime is reduced from 4 minutes to 1.8 seconds. The efficiency gains we realize enable high-quality image contour detection on much larger images than previously practical, and the algorithms we propose are applicable to several image segmentation approaches. Efficient, scalable, yet highly accurate image contour detection will facilitate increased performance in many computer vision applications.

international conference on computer vision | 2011

Long term video segmentation through pixel level spectral clustering on GPUs

Narayanan Sundaram; Kurt Keutzer

We introduce a new technique for performing video segmentation combining the state-of-the-art image segmentation and optical flow algorithms on GPUs. We avoid pre-clustering into superpixels and probabilistic reasoning, and instead view the problem as a generalization of image segmentation techniques. Utilizing spectral clustering techniques at the pixel level (as opposed to 2D/3D superpixels), we demonstrate video segmentation over hundreds of frames - far beyond what has been achieved through pixel level spectral segmentation techniques before. Our algorithm achieves comparable accuracy as other sparse motion clustering techniques while still maintaining 100% density in segmentation over long time periods. We achieve better accuracy with lower oversegmentation compared to dense video segmentation techniques. We exploit increased computational power made available through parallelism in GPUs and efficient numerical algorithms to achieve these results. We show our results on the motion segmentation dataset [4]. Our technique can also be used to provide good quality 3D superpixels and extended to tasks where the ability to track 3D volumes over time is useful.

international parallel and distributed processing symposium | 2016

GraphPad: Optimized Graph Primitives for Parallel and Distributed Platforms

Michael J. Anderson; Narayanan Sundaram; Nadathur Satish; Md. Mostofa Ali Patwary; Theodore L. Willke; Pradeep Dubey

The duality between graphs and matrices means that many common graph analyses can be expressed with primitives such as generalized sparse matrix-vector multiplication (SpMSpV) and sparse matrix-matrix multiplication (SpGEMM). Achieving high performance on these primitives is challenging due to limited arithmetic intensity, irregular memory accesses, and significant network communication requirements in the distributed setting. In this paper we implement four graph applications using GraphPad, our optimized multinode implementations of generalized linear algebra primitives such as SpMSpV and SpGEMM. GraphPad is highly flexible to accommodate multiple data layouts, partitioning strategies, and incorporates communication optimizations. Our performance at scale can exceed that of CombBLAS by up to 40×. In addition to GraphPads performance in a distributed setting, it is also within 2× the performance of GraphMat, a high performance graph framework on a single node for four out of five benchmarks. We also show our communication optimizations and flexibility are critical for good performance on both HPC clusters and commodity cloud platforms.

ieee international conference on high performance computing, data, and analytics | 2009

Optimizing the use of GPU memory in applications with large data sets

Nadathur Satish; Narayanan Sundaram; Kurt Keutzer

With General Purpose programmable GPUs becoming more and more popular, automated tools are needed to bridge the gap between achievable performance from highly parallel architectures and the performance required in applications. In this paper, we concentrate on improving GPU memory management for applications with large and intermediate data sets that do not completely fit in GPU memory. For such applications, the movement of the extra data to CPU memory must be carefully managed. In particular, we focus on solving the joint task scheduling and data transfer scheduling problem posed in [1], and propose an algorithm that gives close to optimal results (as measured by running simulated annealing overnight) in terms of the amount of data transferred for image processing benchmarks such as edge detection and Convolutional Neural Networks. Our results enable a reduction of up to 30× in the amount of data transfers compared to an unoptimized implementation. They are up to 2× better than the methods previously proposed in [1] and less than 16% away from the optimal solution.

very large data bases | 2017

Bridging the gap between HPC and big data frameworks

Michael R. Anderson; Shaden Smith; Narayanan Sundaram; Mihai Capotă; Zheguang Zhao; Subramanya R. Dulloor; Nadathur Satish; Theodore L. Willke

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.

international parallel and distributed processing symposium | 2016

PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures

Md. Mostofa Ali Patwary; Nadathur Satish; Narayanan Sundaram; Jialin Liu; Peter J. Sadowski; Evan Racah; Surendra Byna; Craig Tull; Wahid Bhimji; Prabhat; Pradeep Dubey

Computing k-Nearest Neighbors (KNN) is one of the core kernels used in many machine learning, data mining and scientific computing applications. Although kd-tree based O(log n) algorithms have been proposed for computing KNN, due to its inherent sequentiality, linear algorithms are being used in practice. This limits the applicability of such methods to millions of data points, with limited scalability for Big Data analytics challenges in the scientific domain. In this paper, we present parallel and highly optimized kd*tree based KNN algorithms (both construction and querying) suitable for distributed architectures. Our algorithm includes novel approaches for pruning search space and improving load balancing and partitioning among nodes and threads. Using TB-sized datasets from three science applications: astrophysics, plasma physics, and particle physics, we show that our implementation can construct kd-tree of 189 billion particles in 48 seconds on utilizing ~50,000 cores. We also demonstrate computation of KNN of 19 billion queries in 12 seconds. We demonstrate almost linear speedup both for shared and distributed memory computers. Our algorithms outperforms earlier implementations by more than order of magnitude, thereby radically improving the applicability of our implementation to state-of-the-art Big Data analytics problems.

Multiprocessor System-on-Chip | 2011

PALLAS: Mapping Applications onto Manycore

Michael J. Anderson; Bryan Catanzaro; Jike Chong; Ekaterina Gonina; Kurt Keutzer; Chao-Yue Lai; Mark Murphy; Bor-Yiing Su; Narayanan Sundaram

Parallel programming using the current state-of-the-art in software engineering techniques is hard. Expertise in parallel programming is necessary to deliver good performance in applications; however, it is very common that domain experts lack the requisite expertise in parallel programming. In order to drive the computer science research toward effectively using the available parallel hardware platforms, it is very important to make parallel programming systematical and productive. We believe that the key to designing parallel programs in a systematical way is software architecture, and the key to improve the productivity of developing parallel programs is software frameworks. The basis of both is design patterns and a pattern language.

Archive | 2010