Is this you? Create Your Porfile

Rupesh Nasre

Indian Institute of Technology Madras

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rupesh Nasre is active.

Explore More

Publication

Featured researches published by Rupesh Nasre.

asian symposium on programming languages and systems | 2009

Scalable Context-Sensitive Points-to Analysis Using Multi-dimensional Bloom Filters

Rupesh Nasre; Kaushik Rajan; R. Govindarajan; Uday P. Khedker

Context-sensitive points-to analysis is critical for several program optimizations. However, as the number of contexts grows exponentially, storage requirements for the analysis increase tremendously for large programs, making the analysis non-scalable. We propose a scalable flow-insensitive context-sensitive inclusion-based points-to analysis that uses a specially designed multi-dimensional bloom filter to store the points-to information. Two key observations motivate our proposal: (i ) points-to information (between pointer-object and between pointer-pointer) is sparse, and (ii ) moving from an exact to an approximate representation of points-to information only leads to reduced precision without affecting correctness of the (may-points-to) analysis. By using an approximate representation a multi-dimensional bloom filter can significantly reduce the memory requirements with a probabilistic bound on loss in precision. Experimental evaluation on SPEC 2000 benchmarks and two large open source programs reveals that with an average storage requirement of 4MB, our approach achieves almost the same precision (98.6%) as the exact implementation. By increasing the average memory to 27MB, it achieves precision upto 99.7% for these benchmarks. Using Mod/Ref analysis as the client, we find that the client analysis is not affected that often even when there is some loss of precision in the points-to representation. We find that the NoModRef percentage is within 2% of the exact analysis while requiring 4MB (maximum 15MB) memory and less than 4 minutes on average for the points-to analysis. Another major advantage of our technique is that it allows to trade off precision for memory usage of the analysis.

symposium on code generation and optimization | 2011

Prioritizing constraint evaluation for efficient points-to analysis

Rupesh Nasre; R. Govindarajan

Pervasive use of pointers in large-scale real-world applications continues to make points-to analysis an important optimization-enabler. Rapid growth of software systems demands a scalable pointer analysis algorithm. A typical inclusion-based points-to analysis iteratively evaluates constraints and computes a points-to solution until a fixpoint. In each iteration, (i) points-to information is propagated across directed edges in a constraint graph G and (ii) more edges are added by processing the points-to constraints. We observe that prioritizing the order in which the information is processed within each of the above two steps can lead to efficient execution of the points-to analysis. While earlier work in the literature focuses only on the propagation order, we argue that the other dimension, that is, prioritizing the constraint processing, can lead to even higher improvements on how fast the fixpoint of the points-to algorithm is reached. This becomes especially important as we prove that finding an optimal sequence for processing the points-to constraints is NP-Complete. The prioritization scheme proposed in this paper is general enough to be applied to any of the existing points-to analyses. Using the prioritization framework developed in this paper, we implement prioritized versions of Andersens analysis, Deep Propagation, Hardekopf and Lins Lazy Cycle Detection and Bloom Filter based points-to analysis. In each case, we report significant improvements in the analysis times (33%, 47%, 44%, 20% respectively) as well as the memory requirements for a large suite of programs, including SPEC 2000 benchmarks and five large open source programs.

compiler construction | 2012

Parallel replication-based points-to analysis

Sandeep Putta; Rupesh Nasre

Pointer analysis is one of the most important static analyses during compilation. While several enhancements have been made to scale pointer analysis, the work on parallelizing the analysis itself is still in infancy. In this article, we propose a parallel version of context-sensitive inclusion-based points-to analysis for C programs. Our analysis makes use of replication of points-to sets to improve parallelism. In comparison to the former work on parallel points-to analysis, we extract more parallelism by exploiting a key insight based on monotonicity and unordered nature of flow-insensitive points-to analysis. By taking advantage of the nature of points-to analysis and the structure of constraint graph, we devise several novel optimizations to further improve the overall speed-up. We show the effectiveness of our approach using 16 SPEC 2000 benchmarks and five large open source programs that range from 1.2 KLOC to 0.5 MLOC. Specifically, our context-sensitive analysis achieves an average speed-up of 3.4× on an 8-core machine.

ACM Transactions on Architecture and Code Optimization | 2016

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Unnikrishnan Cheramangalath; Rupesh Nasre; Y. N. Srikant

Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy—even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.

static analysis symposium | 2010

Points-to analysis as a system of linear equations

Rupesh Nasre; R. Govindarajan

We propose a novel formulation of the points-to analysis as a system of linear equations. With this, the efficiency of the points-to analysis can be significantly improved by leveraging the advances in solution procedures for solving the systems of linear equations. However, such a formulation is non-trivial and becomes challenging due to various facts, namely, multiple pointer indirections, address-of operators and multiple assignments to the same variable. Further, the problem is exacerbated by the need to keep the transformed equations linear. Despite this, we successfully model all the pointer operations. We propose a novel inclusion-based context-sensitive points-to analysis algorithm based on prime factorization, which can model all the pointer operations. Experimental evaluation on SPEC 2000 benchmarks and two large open source programs reveals that our approach is competitive to the state-of-the-art algorithms. With an average memory requirement of mere 21MB, our context-sensitive points-to analysis algorithm analyzes each benchmark in 55 seconds on an average.

european symposium on programming | 2011

Dataflow analysis for datarace-free programs

Arnab De; Deepak D'Souza; Rupesh Nasre

Memory models for shared-memory concurrent programming languages typically guarantee sequential consistency (SC) semantics for datarace-free (DRF) programs, while providing very weak or no guarantees for non-DRF programs. In effect programmers are expected to write only DRF programs, which are then executed with SC semantics. With this in mind, we propose a novel scalable solution for dataflow analysis of concurrent programs, which is proved to be sound for DRF programs with SC semantics. We use the synchronization structure of the program to propagate dataflow information among threads without requiring to consider all interleavings explicitly.Given a dataflow analysis that is sound for sequential programs and meets certain criteria, our technique automatically converts it to an analysis for concurrent programs.

acm sigplan symposium on principles and practice of parallel programming | 2016

DomLock: a new multi-granularity locking technique for hierarchies

Saurabh Kalikar; Rupesh Nasre

We present efficient locking mechanisms for hierarchical data structures. Several applications work on an abstract hierarchy of objects, and a parallel execution on this hierarchy necessitates synchronization across workers operating on different parts of the hierarchy. Existing synchronization mechanisms are either too coarse, too inefficient, or too ad hoc, resulting in reduced or unpredictable amount of concurrency. We propose a new locking approach based on the structural properties of the underlying hierarchy. We show that the developed techniques are efficient even when the hierarchy is an arbitrary graph, and are applicable even when the hierarchy involves mutation. Theoretically, we present our approach as a locking-cost-minimizing instance of a generic algebraic model of synchronization for hierarchical data structures. Using STMBench7, we illustrate considerable reduction in the locking cost, resulting in an average throughput improvement of 42%.

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness | 2011

Approximating inclusion-based points-to analysis

Rupesh Nasre

It has been established that achieving a points-to analysis that is scalable in terms of analysis time typically involves trading off analysis precsision and/or memory. In this paper, we propose a novel technique to approximate the solution of an inclusion-based points-to analysis. The technique is based on intelligently approximating pointer- and location-equivalence across variables in the program. We develop a simple approximation algorithm based on the technique. By exploiting various behavioral properties of the solution, we develop another improved algorithm which implements various optimizations related to the merging order, proximity search, lazy merging and identification frequency. The improved algorithm provides a strong control to the client to trade off analysis time and precision as per its requirements. Using a large suite of programs including SPEC 2000 benchmarks and five large open source programs, we show how our algorithm helps achieve a scalable solution.

languages and compilers for parallel computing | 2016

LightHouse: An Automatic Code Generator for Graph Algorithms on GPUs

G. Shashidhar; Rupesh Nasre

We propose LightHouse, a GPU code-generator for a graph language named Green-Marl for which a multicore CPU backend already exists. This allows a user to seamlessly generate both the multicore as well as the GPU backends from the same specification of a graph algorithm. This restriction of not modifying the language poses several challenges as we work with an existing abstract syntax tree of the language, which is not tailored to GPUs. LightHouse overcomes these challenges with various optimizations such as reducing the number of atomics and collapsing loops. We illustrate its effectiveness by generating efficient CUDA codes for four graph analytic algorithms, and comparing performance against their multicore OpenMP versions generated by Green-Marl. In particular, our generated CUDA code performs comparable to 4 to 64-threaded OpenMP versions for different algorithms.

acm sigplan symposium on principles and practice of parallel programming | 2016

GPU centric extensions for parallel strongly connected components computation

Shrinivas Devshatwar; Madhur Amilkanthwar; Rupesh Nasre

Finding Strongly Connected Components (SCC) of a directed graph is a fundamental graph problem. Many of the state-of-the-art sequential algorithms use depth-first search (DFS) to find SCCs. Since, in general DFS is hard to parallelize, researchers rely on other approaches such as Forward-Backward-Trim and Coloring. In this work, we have extended the two state-of-the-art multicore parallel algorithms to take advantage of the highly powerful GPU devices. Unfortunately, state-of-the-art parallel implementations have three performance limitations: (i) selection of ineffective pivots for running multiple forward-backward, (ii) load imbalance across threads while computing reachability closures, and (iii) serializability bottleneck while computing forward-backward sets. We address the first limitation using an improved pivot selection scheme. This improves the chances of finding the largest SCC in real-world graphs. We handle the other two limitations by adding parallelism during closure computation. This improves load balance across warp-threads and reduces thread-serialization. In effect, the resultant codes perform much better compared to the state-of-the-art. The evaluation results show that our methods achieve up to 8× speedup over Tarjans sequential algorithm and up to 2× speedup over a previous CUDA implementation.

Explore More