Milind Kulkarni | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Milind Kulkarni is active.

Explore More

Publication

Featured researches published by Milind Kulkarni.

acm sigplan symposium on principles and practice of parallel programming | 2009

How much parallelism is there in irregular applications

Milind Kulkarni; Martin Burtscher; Rajasekhar Inkulu; Keshav Pingali; Calin Cascaval

Irregular programs are programs organized around pointer-based data structures such as trees and graphs. Recent investigations by the Galois project have shown that many irregular programs have a generalized form of data-parallelism called amorphous data-parallelism. However, in many programs, amorphous data-parallelism cannot be uncovered using static techniques, and its exploitation requires runtime strategies such as optimistic parallel execution. This raises a natural question: how much amorphous data-parallelism actually exists in irregular programs? In this paper, we describe the design and implementation of a tool called ParaMeter that produces parallelism profiles for irregular programs. Parallelism profiles are an abstract measure of the amount of amorphous data-parallelism at different points in the execution of an algorithm, independent of implementation-dependent details such as the number of cores, cache sizes, load-balancing, etc. ParaMeter can also generate constrained parallelism profiles for a fixed number of cores. We show parallelism profiles for seven irregular applications, and explain how these profiles provide insight into the behavior of these applications.

international conference on parallel architectures and compilation techniques | 2010

Accelerating multicore reuse distance analysis with sampling and parallelization

Derek L. Schuff; Milind Kulkarni; Vijay S. Pai

Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance proiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.

2008 IEEE Symposium on Interactive Ray Tracing | 2008

Fast agglomerative clustering for rendering

Bruce Walter; Kavita Bala; Milind Kulkarni; Keshav Pingali

Hierarchical representations of large data sets, such as binary cluster trees, are a crucial component in many scalable algorithms used in various fields. Two major approaches for building these trees are agglomerative, or bottom-up, clustering and divisive, or top-down, clustering. The agglomerative approach offers some real advantages such as more flexible clustering and often produces higher quality trees, but has been little used in graphics because it is frequently assumed to be prohibitively expensive (O(N2) or worse). In this paper we show that agglomerative clustering can be done efficiently even for very large data sets. We introduce a novel locally-ordered algorithm that is faster than traditional heap-based agglomerative clustering and show that the complexity of the tree build time is much closer to linear than quadratic. We also evaluate the quality of the agglomerative clustering trees compared to the best known divisive clustering strategies in two sample applications: bounding volume hierarchies for ray tracing and light trees in the Lightcuts rendering algorithm. Tree quality is highly application, data set, and dissimilarity function specific. In our experiments the agglomerative-built tree quality is consistently higher by margins ranging from slight to significant, with up to 35% reduction in tree query times.

architectural support for programming languages and operating systems | 2008

Optimistic parallelism benefits from data partitioning

Milind Kulkarni; Keshav Pingali; Ganesh Ramanarayanan; Bruce Walter; Kavita Bala; L. Paul Chew

Recent studies of irregular applications such as finite-element mesh generators and data-clustering codes have shown that these applications have a generalized data parallelism arising from the use of iterative algorithms that perform computations on elements of worklists. In some irregular applications, the computations on different elements are independent. In other applications, there may be complex patterns of dependences between these computations. The Galois system was designed to exploit this kind of irregular data parallelism on multicore processors. Its main features are (i) two kinds of set iterators for expressing worklist-based data parallelism, and (ii) a runtime system that performs optimistic parallelization of these iterators, detecting conflicts and rolling back computations as needed. Detection of conflicts and rolling back iterations requires information from class implementors. In this paper, we introduce mechanisms to improve the execution efficiency of Galois programs: data partitioning, data-centric work assignment, lock coarsening, and over-decomposition. These mechanisms can be used to exploit locality of reference, reduce mis-speculation, and lower synchronization overhead. We also argue that the design of the Galois system permits these mechanisms to be used with relatively little modification to the user code. Finally, we present experimental results that demonstrate the utility of these mechanisms.

acm symposium on parallel algorithms and architectures | 2008

Scheduling strategies for optimistic parallel execution of irregular programs

Milind Kulkarni; Patrick Carribault; Keshav Pingali; Ganesh Ramanarayanan; Bruce Walter; Kavita Bala; L. Paul Chew

Recent application studies have shown that many irregular applications have a generalized data parallelism that manifests itself as iterative computations over worklists of different kinds. In general, there are complex dependencies between iterations. These dependencies cannot be elucidated statically because they depend on the inputs to the program; thus, optimistic parallel execution is the only tractable approach to parallelizing these applications. We have built a system called Galois that supports this style of parallel execution. Its main features are (i) set iterators for expressing worklist-based data parallelism, and (ii) a runtime system that performs optimistic parallelization of these iterators, detecting conflicts and rolling back computations as needed. Our work builds on the Galois system, and it addresses the problem of scheduling iterations of set iterators on multiple cores. The policy used by the base Galois system is to assign an iteration to a core whenever it needs work to do, but we show in this paper that this policy is not optimal for many applications. We also argue that OpenMP-style DO-ALL loop scheduling directives such as chunked and guided self-scheduling are too simplistic for irregular programs. These difficulties led us to develop a general scheduling framework for irregular problems; OpenMP-style scheduling strategies are special cases of this general approach. We also provide hooks into our framework, allowing the programmer to leverage application knowledge to further tune a schedule for a particular application. To evaluate this framework, we implemented it as an extension of the Galois system. We then tested the system using five real-world, irregular, data-parallel applications. Our results show that (i) the optimal scheduling policy can be different for different applications and often leverages application-specific knowledge and (ii) implementing these schedules in the Galois system is relatively straightforward.

acm sigplan symposium on principles and practice of parallel programming | 2010

Structure-driven optimizations for amorphous data-parallel programs

Mario Méndez-Lojo; Donald Nguyen; Dimitrios Prountzos; Xin Sui; M. Amber Hassaan; Milind Kulkarni; Martin Burtscher; Keshav Pingali

Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for parallelizing irregular applications based on the idea of optimistic or speculative execution of programs. However, the overhead of optimistic parallel execution can be substantial. In this paper, we show that many irregular algorithms have structure that can be exploited and present three key optimizations that take advantage of algorithmic structure to reduce speculative overheads. We describe the implementation of these optimizations in the Galois system and present experimental results to demonstrate their benefits. To the best of our knowledge, this is the first system to exploit algorithmic structure to optimize the execution of irregular programs.

ieee/acm international symposium cluster, cloud and grid computing | 2011

Techniques for Fine-Grained, Multi-site Computation Offloading

Kanad Sinha; Milind Kulkarni

Increasingly, mobile devices are becoming the preferred platform for computation for many users. Unfortunately, the resource limitations, in battery life, computation power and storage, restricts the richness of applications that can be run on such devices. To alleviate these concerns, a popular approach that has gained currency in recent years is {\em computation offloading}, where a portion of an application is run off-site, leveraging the far greater resources of the cloud. Prior work in this area has focused on a constrained form of the problem: a single mobile device offloading computation to a single server. However, with the increased popularity of cloud computing and storage, it is more common for the data that an application accesses to be distributed among several servers. This paper describes algorithmic approaches for performing fine-grained, multi-site offloading. This allows portions of an application to be offloaded in a data-centric manner, even if that data exists at multiple sites. Our approach is based on a novel partitioning algorithm, and a novel data representation. We demonstrate that our partitioning algorithm outperforms existing multi-site offloading algorithms, and that our data representation provides for more efficient, fine-grained offloading than prior approaches.

conference on object oriented programming systems languages and applications | 2013

OCTET: capturing and controlling cross-thread dependences efficiently

Michael D. Bond; Milind Kulkarni; Man Cao; Minjia Zhang; Meisam Fathi Salmi; Swarnendu Biswas; Aritra Sengupta; Jipeng Huang

Parallel programming is essential for reaping the benefits of parallel hardware, but it is notoriously difficult to develop and debug reliable, scalable software systems. One key challenge is that modern languages and systems provide poor support for ensuring concurrency correctness properties - atomicity, sequential consistency, and multithreaded determinism - because all existing approaches are impractical. Dynamic, software-based approaches slow programs by up to an order of magnitude because capturing and controlling cross-thread dependences (i.e., conflicting accesses to shared memory) requires synchronization at virtually every access to potentially shared memory. This paper introduces a new software-based concurrency control mechanism called OCTET that soundly captures cross-thread dependences and can be used to build dynamic analyses for concurrency correctness. OCTET achieves low overheads by tracking the locality state of each potentially shared object. Non-conflicting accesses conform to the locality state and require no synchronization; only conflicting accesses require a state change and heavyweight synchronization. This optimistic tradeoff leads to significant efficiency gains in capturing cross-thread dependences: a prototype implementation of OCTET in a high-performance Java virtual machine slows real-world concurrent programs by only 26% on average. A dependence recorder, suitable for record & replay, built on top of OCTET adds an additional 5% overhead on average. These results suggest that OCTET can provide a foundation for developing low-overhead analyses that check and enforce concurrency correctness.

programming language design and implementation | 2011

Exploiting the commutativity lattice

Milind Kulkarni; Donald Nguyen; Dimitrios Prountzos; Xin Sui; Keshav Pingali

Speculative execution is a promising approach for exploiting parallelism in many programs, but it requires efficient schemes for detecting conflicts between concurrently executing threads. Prior work has argued that checking semantic commutativity of method invocations is the right way to detect conflicts for complex data structures such as kd-trees. Several ad hoc ways of checking commutativity have been proposed in the literature, but there is no systematic approach for producing implementations. In this paper, we describe a novel framework for reasoning about commutativity conditions: the commutativity lattice. We show how commutativity specifications from this lattice can be systematically implemented in one of three different schemes: abstract locking, forward gatekeeping and general gatekeeping. We also discuss a disciplined approach to exploiting the lattice to find different implementations that trade off precision in conflict detection for performance. Finally, we show that our novel conflict detection schemes are practical and can deliver speedup on three real-world applications.

conference on object-oriented programming systems, languages, and applications | 2011

Enhancing locality for recursive traversals of recursive structures

Youngjoon Jo; Milind Kulkarni

While there has been decades of work on developing automatic, locality-enhancing transformations for regular programs that operate over dense matrices and arrays, there has been little investigation of such transformations for irregular programs, which operate over pointer-based data structures such as graphs, trees and lists. In this paper, we argue that, for a class of irregular applications we call traversal codes, there exists substantial data reuse and hence opportunity for locality exploitation. We develop a novel optimization called point blocking, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in traversal codes. We then present a transformation and optimization framework called TreeTiler that automatically detects opportunities for applying point blocking and applies the transformation. TreeTiler uses autotuning techniques to determine appropriate parameters for the transformation. For a series of traversal algorithms drawn from real-world applications, we show that TreeTiler is able to deliver performance improvements of up to 245% over an optimized (but non-transformed) parallel baseline, and in several cases, significantly better scalability.

Explore More