Mahantesh Halappanavar

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mahantesh Halappanavar is active.

Explore More

Publication

Featured researches published by Mahantesh Halappanavar.

parallel computing | 2012

Graph coloring algorithms for multi-core and massively multithreaded architectures

ímit V. Çatalyürek; John Feo; Assefaw Hadish Gebremedhin; Mahantesh Halappanavar; Alex Pothen

We explore the interplay between architectures and algorithm design in the context of shared-memory platforms and a specific graph problem of central importance in scientific and high-performance computing, distance-1 graph coloring. We introduce two different kinds of multithreaded heuristic algorithms for the stated, NP-hard, problem. The first algorithm relies on speculation and iteration, and is suitable for any shared-memory system. The second algorithm uses dataflow principles, and is targeted at the non-conventional, massively multithreaded Cray XMT system. We study the performance of the algorithms on the Cray XMT and two multi-core systems, Sun Niagara 2 and Intel Nehalem. Together, the three systems represent a spectrum of multithreading capabilities and memory structure. As testbed, we use synthetically generated large-scale graphs carefully chosen to cover a wide range of input types. The results show that the algorithms have scalable runtime performance and use nearly the same number of colors as the underlying serial algorithm, which in turn is effective in practice. The study provides insight into the design of high performance algorithms for irregular problems on many-core architectures.

international parallel and distributed processing symposium | 2012

Multithreaded Algorithms for Maximum Matching in Bipartite Graphs

Ariful Azad; Mahantesh Halappanavar; Sivasankaran Rajamanickam; Erik G. Boman; Arif M. Khan; Alex Pothen

We design, implement, and evaluate algorithms for computing a matching of maximum cardinality in a bipartite graph on multicore and massively multithreaded computers. As computers with larger numbers of slower cores dominate the commodity processor market, the design of multithreaded algorithms to solve large matching problems becomes a necessity. Recent work on serial algorithms for the matching problem has shown that their performance is sensitive to the order in which the vertices are processed for matching. In a multithreaded environment, imposing a serial order in which vertices are considered for matching would lead to loss of concurrency and performance. But this raises the question: {\em Would parallel matching algorithms on multithreaded machines improve performance over a serial algorithm?}We answer this question in the affirmative. We report efficient multithreaded implementations of three classes of algorithms based on their manner of searching for augmenting paths: breadth-first-search, depth-first-search, and a combination of both. The Karp-Sipser initialization algorithm is used to make the parallel algorithms practical. We report extensive results and insights using three shared-memory platforms (a 48-core AMD Opteron, a 32-coreIntel Nehalem, and a 128-processor Cray XMT) on a representative set of real-world and synthetic graphs. To the best of our knowledge, this is the first study of augmentation-based parallel algorithms for bipartite cardinality matching that demonstrates good speedups on multithreaded shared memory multiprocessors.

ieee international conference on high performance computing data and analytics | 2012

A multithreaded algorithm for network alignment via approximate matching

Arif M. Khan; David F. Gleich; Alex Pothen; Mahantesh Halappanavar

Network alignment is an optimization problem to find the best one-to-one map between the vertices of a pair of graphs that overlaps as many edges as possible. It is a relaxation of the graph isomorphism problem and is closely related to the subgraph isomorphism problem. The best current approaches are entirely heuristic and iterative in nature. They generate real-valued heuristic weights that must be rounded to find integer solutions. This rounding requires solving a bipartite maximum weight matching problem at each iteration in order to avoid missing high quality solutions. We investigate substituting a parallel, half-approximation for maximum weight matching instead of an exact computation. Our experiments show that the resulting difference in solution quality is negligible. We demonstrate almost a 20-fold speedup using 40 threads on an 8 processor Intel Xeon E7-8870 system and now solve real-world problems in 36 seconds instead of 10 minutes.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Distributed-Memory Parallel Algorithms for Matching and Coloring

Florin Dobrian; Assefaw Hadish Gebremedhin; Mahantesh Halappanavar; Alex Pothen

We discuss the design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical graph problems, edge-weighted matching and distance-1 vertex coloring. Graph algorithms in general have low concurrency, poor data locality, and high ratio of data access to computation costs, making it challenging to achieve scalability on massively parallel machines. We overcome this challenge by employing a variety of techniques, including speculation and iteration, optimized communication, and randomization. We present preliminary results on weak and strong scalability studies conducted on an IBM Blue Gene/P machine employing up to tens of thousands of processors. The results show that the algorithms hold strong potential for computing at petascale.

international parallel and distributed processing symposium | 2014

Parallel Heuristics for Scalable Community Detection

Hao Lu; Mahantesh Halappanavar; Ananth Kalyanaraman; Sutanay Choudhury

Community detection has become a fundamental operation in numerous graph-theoretic applications. It is used to reveal natural divisions that exist within real world networks without imposing prior size or cardinality constraints on the set of communities. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed by Blondel et al. in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method is also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains (e.g., internet, citation, biological). Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number of iterations, while providing real speedups of up to 8× using 32 threads. In addition, our parallel implementation was able to exhibit weak scaling properties on up to 32 threads.

international parallel and distributed processing symposium | 2014

New Effective Multithreaded Matching Algorithms

Fredrik Manne; Mahantesh Halappanavar

Matching is an important combinatorial problem with a number of applications in areas such as community detection, sparse linear algebra, and network alignment. Since computing optimal matchings can be very time consuming, several fast approximation algorithms, both sequential and parallel, have been suggested. Common to the algorithms giving the best solutions is that they tend to be sequential by nature, while algorithms more suitable for parallel computation give solutions of lower quality. We present a new simple 1/2-approximation algorithm for the weighted matching problem. This algorithm is both faster than any other suggested sequential 1/2-approximation algorithm on almost all inputs and when parallelized also scales better than previous multithreaded algorithms. We further extend this to a general scalable multithreaded algorithm that computes matchings of weight comparable with the best sequential deterministic algorithms. The performance of the suggested algorithms is documented through extensive experiments on different multithreaded architectures.

ieee international conference on technologies for homeland security | 2013

Towards a theory of autonomous reconstitution of compromised cyber-systems

Pradeep Ramuhalli; Mahantesh Halappanavar; Jamie B. Coble; Mukul Dixit

Effective reconstitution approaches for cyber systems are needed to keep critical infrastructure operational in the face of an intelligent adversary. The reconstitution response, including recovery and adaptation, may require significant reconfiguration of the system at all levels to render the cyber-system resilient to ongoing and future attacks or faults while maintaining continuity of operations. A theoretical basis for optimal dynamic reconstitution is needed to address the challenge of ensuring that dynamic reconstitution is optimal with respect to resilience metrics, and is being developed and evaluated in this project. Such a framework provides the technical basis for evaluating cyber-defense and reconstitution approaches. This paper describes a preliminary framework that may be used to develop and evaluate concepts for effective autonomous reconstitution of compromised cyber systems.

ieee international conference on high performance computing, data, and analytics | 2014

Scaling graph community detection on the Tilera many-core architecture

Daniel G. Chavarría-Miranda; Mahantesh Halappanavar; Anantharaman Kalyanaraman

In an era when power constraints and data movement are proving to be significant barriers for the application of high-end computing, the Tilera many-core architecture offers a low-power platform exhibiting many important characteristics of future systems, including a large number of simple cores, a sophisticated network-on-chip, and fine-grained control over memory and caching policies. While this emerging architecture has been previously studied for structured compute-intensive kernels, benchmarking the platform for data-bound, irregular applications present significant challenges that have remained unexplored. Community detection is an advanced prototypical graph-theoretic operation with applications in numerous scientific domains including life sciences, cyber security, and power systems. In this work, we explore multiple design strategies toward developing a scalable tool for community detection on the Tilera platform. Using several memory layout and work scheduling techniques we demonstrate speedups of up to 47× on 36 cores of the Tilera TileGX36 platform over the best serial implementation, and also show results that have comparable quality and performance to mainstream x86 platforms. To the best of our knowledge this is the first work addressing graph algorithms on the Tilera platform. This study demonstrates that through careful design space exploration, low-power many-core platforms like Tilera can be effectively exploited for graph algorithms that embody all the essential characteristics of an irregular application.

international parallel and distributed processing symposium | 2015

Balanced Coloring for Parallel Computing Applications

Hao Lu; Mahantesh Halappanavar; Daniel G. Chavarría-Miranda; Assefaw Hadish Gebremedhin; Ananth Kalyanaraman

Graph colouring is used to identify subsets of independent tasks in parallel scientific computing applications. Traditional colouring heuristics aim to reduce the number of colours used as that number also corresponds to the number of parallel steps in the application. However, if the color classes produced have a skew in their sizes, utilization of hardware resources becomes inefficient, especially for the smaller color classes. Equitable colouring is a theoretical formulation of colouring that guarantees a perfect balance among color classes, and its practical relaxation is referred to as balanced colouring. In this paper, we revisit the problem of balanced colouring in the context of parallel computing. The goal is to achieve a balanced colouring of an input graph without increasing the number of colours that an algorithm oblivious to balance would have used. We propose and study multiple heuristics that aim to achieve such a balanced colouring, present parallelization approaches for multi-core and manicure architectures, and cross-evaluate their effectiveness with respect to the quality of balance achieved and performance. Furthermore, we study the impact of the proposed balanced colouring heuristics on a concrete application - viz. parallel community detection, which is an example of an irregular application. The thorough treatment of balanced colouring presented in this paper from algorithms to application is expected to serve as a valuable resource to parallel application developers who seek to improve parallel performance of their applications using colouring.

computing frontiers | 2011

Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Nawab Ali; Sriram Krishnamoorthy; Mahantesh Halappanavar; Jeffrey A. Daily

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.

Explore More