Rajkishore Barik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rajkishore Barik is active.

Explore More

Publication

Featured researches published by Rajkishore Barik.

international parallel and distributed processing symposium | 2009

Work-first and help-first scheduling policies for async-finish task parallelism

Yi Guo; Rajkishore Barik; Raghavan Raman; Vivek Sarkar

Multiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address-space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Thread Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work stealing, as embodied in Cilks implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this paper, we address the problem of efficient and scalable implementation of X10s async-finish task parallelism, which is more general than Cilks spawn-sync parallelism. We introduce a new work-stealing scheduler with compiler support for async-finish task parallelism that can accommodate both work-first and help-first scheduling policies. Performance results on two different multicore SMP platforms show significant improvements due to our new work-stealing algorithm compared to the existing work-sharing scheduler for X10, and also provide insights on scenarios in which the help-first policy yields better results than the work-first policy and vice versa.

acm sigplan symposium on principles and practice of parallel programming | 2007

May-happen-in-parallel analysis of X10 programs

Shivali Agarwal; Rajkishore Barik; Vivek Sarkar; Rudrapatna K. Shyamasundar

X10 is a modern object-oriented programming language designed for high performance, high productivity programming of parallel and multi-core computer systems. Compared to the lower-level thread-based concurrency model in the JavaTM language, X10 has higher-level concurrency constructs such as async, atomic and finish built into the language to simplify creation, analysis and optimization of parallel programs. In this paper, we introduce a new algorithm for May-Happen-in-Parallel (MHP) analysis of X10 programs. The analysis algorithm is based on simple path traversals in the Program Structure Tree, and does not rely on pointer alias analysis of thread objects as in MHP analysis for Java programs. We introduce a more precise definition of the MHP relation than in past work by adding condition vectors that identify execution instances for which the MHP relation holds, instead of just returning a single true/false value for all pairs of executing instances. Further, MHP analysis is refined in our approach by using the observation that two statement instances which occur in atomic sections that execute at the same X10 place must have MHP = false. We expect that our MHP analysis algorithm will be applicable to any language that adopts the core concepts of places, async, finish, and atomic sections from the X10 programming model. We also believe that this approach offers the best of two worlds to programmers and parallel programming tools ---higher-level abstractions of concurrency coupled with simple and efficient analysis algorithms.

acm symposium on parallel algorithms and architectures | 2007

Deadlock-free scheduling of X10 computations with bounded resources

Shivali Agarwal; Rajkishore Barik; Dan Bonachea; Vivek Sarkar; Rudrapatna K. Shyamasundar; Katherine A. Yelick

In this paper,we address the problem of guaranteeing the absence of physical deadlock in the execution of a parallel program using the async, finish, atomic, and place constructs from the X10 language. First, we extend previous work-stealing memory bound results for fully strict multi-threaded computations to terminally strict multithreaded computations in which one activity may wait for completion of a descendant activity (as in X10s async and finish constructs), not just an immediate child (as in Cilk s spawn and sync constructs). This result establishes physical dead-lock freedom for SMP deployments.Second,we introduce a new class of X10 deployments for clusters, which builds on an underlying Active Message network and the new concept of Doppelgänger mode execution of X10 activities. Third, we use this new class of deployments to establish physical deadlock freedom for deployments on clusters of uniprocessors. Together these results give the user the ability to execute a rich set of programs written with async finish atomic and place constructs without worrying about the possibility of physical deadlock due to computation, memory and communication resources. A major open topic for future work is to extend these results to deployments on clusters of SMPs.

international symposium on microarchitecture | 2010

Efficient Selection of Vector Instructions Using Dynamic Programming

Rajkishore Barik; Jisheng Zhao; Vivek Sarkar

Accelerating program performance via SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, VSE, and VSX SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers fall short of using the full potential of vector units, and only generate vector code for simple innermost loops. In this paper, we present the design and implementation of anauto-vectorization framework in the back-end of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. The framework includes a novel{\em compile-time efficient dynamic programming-based} vector instruction selection algorithm for straight-line code that expands opportunities for vectorization in the following ways: (1) {\em scalar packing} explores opportunities of packing multiple scalar variables into short vectors, (2)judicious use of {\em shuffle} and {\em horizontal} vector operations, when possible, and (3) {\em algebraic reassociation} expands opportunities for vectorization by algebraic simplification. We report performance results on the impact of auto-vectorization on a set of standard numerical benchmarks using the Jikes RVM dynamic compilation environment. Our results show performance improvement of up to 57.71\% on an Intel Xeon processor, compared tonon-vectorized execution, with a modest increase in compile-time in the range from 0.87\% to 9.992\%. An investigation of the SIMD parallelization performed by v11.1 of the Intel Fortran Compiler (IFC) on three benchmarks shows that our system achieves speedup with vectorization in all three cases and IFC does not. Finally, a comparison of our approach with an implementation of the Super word Level Parallelization (SLP) algorithm from~\cite{larsen00}, shows that our approach yields a performance improvement of up to 13.78\% relative to SLP.

conference on object-oriented programming systems, languages, and applications | 2009

The habanero multicore software research project

Rajkishore Barik; Zoran Budimlic; Vincent Cavé; Sanjay Chatterjee; Yi Guo; David M. Peixotto; Raghavan Raman; Jun Shirako; Sagnak Tasirlar; Yonghong Yan; Yisheng Zhao; Vivek Sarkar

Multiple programming models are emerging to address an increased need for dynamic task parallelism in multicore shared-memory multiprocessors. This poster describes the main components of Rice Universitys Habanero Multicore Software Research Project, which proposes a new approach to multicore software enablement based on a two-level programming model consisting of a higher-level coordination language for domain experts and a lower-level parallel language for programming experts.

compiler construction | 2007

Extended linear scan: an alternate foundation for global register allocation

Vivek Sarkar; Rajkishore Barik

In this paper, we extend past work on Linear Scan register allocation, and propose two Extended Linear Scan (ELS) algorithms that retain the compile-time efficiency of past Linear Scan algorithms while delivering performance that can match or surpass that of Graph Coloring. Specifically, this paper makes the following contributions: - We highlight three fundamental theoretical limitations in using Graph Coloring as a foundation for global register allocation, and introduce a basic Extended Linear Scan algorithm, ELS0, which addresses all three limitations for the problem of Spill-Free Register Allocation. - We introduce the ELS1 algorithm which extends ELS0 to obtain a greedy algorithm for the problem of Register Allocation with Total Spills. - Finally, we present experimental results to compare the Graph Coloring and Extended Linear Scan algorithms. Our results show that the compile-time speedups for ELS1 relative to GC were significant, and varied from 15× to 68×. In addition, the resulting execution time improved by up to 5.8%, with an average improvement of 2.3%. Together, these results show that Extended Linear Scan is promising as an alternate foundation for global register allocation, compared to Graph Coloring, due to its compile-time scalability without loss of execution time performance.

international conference on parallel architectures and compilation techniques | 2009

Interprocedural Load Elimination for Dynamic Optimization of Parallel Programs

Rajkishore Barik; Vivek Sarkar

Load elimination is a classical compiler transformation that is increasing in importance for multi-core and many-core architectures. The effect of the transformation is to replace a memory access, such as a read of an object field or an array element, by a read of a compiler-generated temporary that can be allocated in faster and more energy-efficient storage structures such as registers and local memories (scratchpads). Unfortunately, current just-in-time and dynamic compilers perform load elimination only in limited situations. In particular, they usually make worst-case assumptions about potential side effects arising from parallel constructs and method calls. These two constraints interact with each other since parallel constructs are usually translated to low-level runtime library calls. In this paper, we introduce an interprocedural load elimination algorithm suitable for use in dynamic optimization of parallel programs. The main contributions of the paper include: a) an algorithm for load elimination in the presence of three core parallel constructs -- async, finish, and isolated, b) efficient side-effect analysis for method calls, c) extended side-effect analysis for parallel constructs using an Isolation Consistency memory model, and d) performance results to study the impact of load elimination on a set of standard benchmarks using an implementation of the algorithm in Jikes RVM for optimizing programs written in a subset of the X10 v1.5 language. Our performance results show decreases in dynamic counts for getfield operations of up to 99.99%, and performance improvements of up to 1.76x on 1 core, and 1.39x on 16 cores, when comparing the algorithm in this paper with the load elimination algorithm available in Jikes RVM.

languages and compilers for parallel computing | 2005

Efficient computation of may-happen-in-parallel information for concurrent java programs

Rajkishore Barik

Modeling of runtime threads in static analysis of concurrent programs plays an important role in both reducing the complexity and improving the precision of the analysis. Modeling based on type based techniques merges all runtime instances of a particular type and thereby introduces inaccuracy in the analysis. Other approaches model individual runtime threads explicitly in the analysis and are of high complexity. In this paper we introduce a thread model that is both context and flow sensitive. Individual thread abstractions are identified based on the context and multiplicity of the creation site. The interaction among these abstract threads are depicted in a tree structure known as Thread Creation Tree (TCT). The TCT structure is subsequently exploited to efficiently compute May-Happen-in-Parallel (MHP) information for the analysis of multi-threaded programs. For concurrent Java programs, our MHP computation algorithm runs 1.77x (on an average) faster than previously reported MHP computation algorithm.

symposium on code generation and optimization | 2014

Efficient Mapping of Irregular C++ Applications to Integrated GPUs

Rajkishore Barik; Rashid Kaleem; Deepak Majeti; Brian T. Lewis; Tatiana Shpeisman; Chunling Hu; Yang Ni; Ali-Reza Adl-Tabatabai

There is growing interest in using GPUs to accelerate general-purpose computation since they offer the potential of massive parallelism with reduced energy consumption. This interest has been encouraged by the ubiquity of integrated processors that combine a GPU and CPU on the same die, lowering the cost of offloading work to the GPU. However, while the majority of effort has focused on GPU acceleration of regular applications, relatively little is known about the behavior of irregular applications on GPUs. These applications are expected to perform poorly on GPUs without major software engineering effort. We present a compiler framework with support for C++ features that enables GPU acceleration of a wide range of C++ applications with minimal changes. This framework, Concord, includes a low-cost, software SVM implementation that permits seamless sharing of pointer-containing data structures between the CPU and GPU. It also includes compiler optimizations to improve irregular application performance on GPUs. Using Concord, we ran nine irregular C++ programs on two computer systems containing Intel 4th Generation Core processors. One system is an Ultrabook with an integrated HD Graphics 5000 GPU, and the other system is a desktop with an integrated HD Graphics 4600 GPU. The nine applications are pointer-intensive and operate on irregular data structures such as trees and graphs; they include face detection, BTree, single-source shortest path, soft-body physics simulation, and breadth-first search. Our results show that Concord acceleration using the GPU improves energy efficiency by up to 6.04× on the Ultrabook and 3.52× on the desktop over multicore-CPU execution.

international parallel and distributed processing symposium | 2011

Communication Optimizations for Distributed-Memory X10 Programs

Rajkishore Barik; Jisheng Zhao; David Grove; Igor Peshansky; Zoran Budimlic; Vivek Sarkar

X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node Blue Gene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the Blue Gene/P cluster, we observed a maximum performance improvement of 31.46x relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01x (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73x (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.

Explore More