Zoran Budimlic | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zoran Budimlic is active.

Explore More

Publication

Featured researches published by Zoran Budimlic.

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism archive | 2010

Concurrent Collections

Zoran Budimlic; Michael G. Burke; Vincent Cavé; Kathleen Knobe; Geoff Lowney; Ryan R. Newton; Jens Palsberg; David M. Peixotto; Vivek Sarkar; Frank Schlimbach; Sagnak Tasirlar

We introduce the Concurrent Collections (CnC) programming model. CnC supports flexible combinations of task and data parallelism while retaining determinism. CnC is implicitly parallel, with the user providing high-level operations along with semantic ordering constraints that together form a CnC graph. n nWe formally describe the execution semantics of CnC and prove that the model guarantees deterministic computation. We evaluate the performance of CnC implementations on several applications and show that CnC offers performance and scalability equivalent to or better than that offered by lower-level parallel programming models.

programming language design and implementation | 2002

Fast copy coalescing and live-range identification

Zoran Budimlic; Keith D. Cooper; Timothy J. Harvey; Ken Kennedy; Timothy S. Oberg; Steven W. Reeves

This paper presents a fast new algorithm for modeling and reasoning about interferences for variables in a program without constructing an interference graph. It then describes how to use this information to minimize copy insertion for &fgr;-node instantiation during the conversion of the static single assignment (SSA) form into the control-flow graph (CFG), effectively yielding a new, very fast copy coalescing and live-range identification algorithm.This paper proves some properties of the SSA form that enable construction of data structures to compute interference information for variables that are considered for folding. The asymptotic complexity of our SSA-to-CFG conversion algorithm is where-is the number of instructions in the program.Performing copy folding during the SSA-to-CFG conversion eliminates the need for a separate coalescing phase while simplifying the intermediate code. This may make graph-coloring register allocation more practical in just in time (JIT) and other time-critical compilers For example, Suns Hotspot Server Compiler already employs a graph-coloring register allocator[10].This paper also presents an improvement to the classical interference-graph based coalescing optimization that shows adecrease in memory usage of up to three orders of magnitude and a decrease of a factor of two in compilation time, while providing the exact same results.We present experimental results that demonstrate that our algorithm is almost as precise (within one percent on average) as the improved interference-graph-based coalescing algorithm, while requiring three times less compilation time.

international parallel and distributed processing symposium | 2013

Integrating Asynchronous Task Parallelism with MPI

Sanjay Chatterjee; Sagnak Tasirlar; Zoran Budimlic; Vincent Cavé; Milind Chabbi; Max Grossman; Vivek Sarkar; Yonghong Yan

Effective combination of inter-node and intra-node parallelism is recognized to be a major challenge for future extreme-scale systems. Many researchers have demonstrated the potential benefits of combining both levels of parallelism, including increased communication-computation overlap, improved memory utilization, and effective use of accelerators. However, current “hybrid programming” approaches often require significant rewrites of application code and assume a high level of programmer expertise. Dynamic task parallelism has been widely regarded as a programming model that combines the best of performance and programmability for shared-memory programs. For distributed-memory programs, most users rely on efficient implementations of MPI. In this paper, we propose HCMPI (Habanero-C MPI), an integration of the Habanero-C dynamic task-parallel programming model with the widely used MPI message-passing interface. All MPI calls are treated as asynchronous tasks in this model, thereby enabling unified handling of messages and tasking constructs. For programmers unfamiliar with MPI, we introduce distributed data-driven futures (DDDFs), a new data-flow programming model that seamlessly integrates intra-node and inter-node data-flow parallelism without requiring any knowledge of MPI. Our novel runtime design for HCMPI and DDDFs uses a combination of dedicated communication and computation specific worker threads. We evaluate our approach on a set of micro-benchmarks as well as larger applications and demonstrate better scalability compared to the most efficient MPI implementations, while offering a unified programming model to integrate asynchronous task parallelism with distributed-memory parallelism.

conference on object-oriented programming systems, languages, and applications | 2009

The habanero multicore software research project

Rajkishore Barik; Zoran Budimlic; Vincent Cavé; Sanjay Chatterjee; Yi Guo; David M. Peixotto; Raghavan Raman; Jun Shirako; Sagnak Tasirlar; Yonghong Yan; Yisheng Zhao; Vivek Sarkar

Multiple programming models are emerging to address an increased need for dynamic task parallelism in multicore shared-memory multiprocessors. This poster describes the main components of Rice Universitys Habanero Multicore Software Research Project, which proposes a new approach to multicore software enablement based on a two-level programming model consisting of a higher-level coordination language for domain experts and a lower-level parallel language for programming experts.

languages, compilers, and tools for embedded systems | 2012

Mapping a data-flow programming model onto heterogeneous platforms

Alina Simion Sbîrlea; Yi Zou; Zoran Budimlic; Jason Cong; Vivek Sarkar

In this paper we explore mapping of a high-level macro data-flow programming model called Concurrent Collections (CnC) onto heterogeneous platforms in order to achieve high performance and low energy consumption while preserving the ease of use of data-flow programming. Modern computing platforms are becoming increasingly heterogeneous in order to improve energy efficiency. This trend is clearly seen across a diverse spectrum of platforms, from small-scale embedded SOCs to large-scale super-computers. However, programming these heterogeneous platforms poses a serious challenge for application developers. We have designed a software flow for converting high-level CnC programs to the Habanero-C language. CnC programs have a clear separation between the application description, the implementation of each of the application components and the abstraction of hardware platform, making it an excellent programming model for domain experts. Domain experts can later employ the help of a tuning expert (either a compiler or a person) to tune their applications with minimal effort. We also extend the Habanero-C runtime system to support work-stealing across heterogeneous computing devices and introduce task affinity for these heterogeneous components to allow users to fine tune the runtime scheduling decisions. We demonstrate a working example that maps a pipeline of medical image-processing algorithms onto a prototype heterogeneous platform that includes CPUs, GPUs and FPGAs. For the medical imaging domain, where obtaining fast and accurate results is a critical step in diagnosis and treatment of patients, we show that our model offers up to 17.72X speedup and an estimated usage of 0.52X of the power used by CPUs alone, when using accelerators (GPUs and FPGAs) and CPUs.

conference on object-oriented programming systems, languages, and applications | 2011

Delegated isolation

Roberto Lublinerman; Jisheng Zhao; Zoran Budimlic; Swarat Chaudhuri; Vivek Sarkar

Isolation---the property that a task can access shared data without interference from other tasks---is one of the most basic concerns in parallel programming. In this paper, we present Aida, a new model of isolated execution for parallel programs that perform frequent, irregular accesses to pointer-based shared data structures. The three primary benefits of Aida are dynamism, safety and liveness guarantees, and programmability. First, Aida allows tasks to dynamically select and modify, in an isolated manner, arbitrary fine-grained regions in shared data structures, all the while maintaining a high level of concurrency. Consequently, the model can achieve scalable parallelization of regular as well as irregular shared-memory applications. Second, the model offers freedom from data races, deadlocks, and livelocks. Third, no extra burden is imposed on programmers, who access the model via a simple, declarative isolation construct that is similar to that for transactional memory. The key new insight in Aida is a notion of delegation among concurrent isolated tasks (known in Aida as assemblies). Each assembly A is equipped with a region in the shared heap that it owns---the only objects accessed by A are those it owns, guaranteeing race-freedom. The region owned by A can grow or shrink flexibly---however, when A needs to own a datum owned by B, A delegates itself, as well as its owned region, to B. From now on, B has the responsibility of re-executing the task A set out to complete. Delegation as above is the only inter-assembly communication primitive in Aida. In addition to reducing contention in a local, data-driven manner, it guarantees freedom from deadlocks and livelocks.n We offer an implementation of Aida on top of the Habanero Java parallel programming language. The implementation employs several novel ideas, including the use of a union-find data structure to represent tasks and the regions that they own. A thorough evaluation using several irregular data-parallel benchmarks demonstrates the low overhead and excellent scalability of Aida, as well as its benefits over existing approaches to declarative isolation. Our results show that Aida performs on par with the state-of-the-art customized implementations of irregular applications and much better than coarse-grained locking and transactional memory approaches.

european conference on object oriented programming | 2012

Practical permissions for race-free parallelism

Edwin M. Westbrook; Jisheng Zhao; Zoran Budimlic; Vivek Sarkar

Type systems that prevent data races are a powerful tool for parallel programming, eliminating whole classes of bugs that are both hard to find and hard to fix. Unfortunately, it is difficult to apply previous such type systems to real programs, as each of them are designed around a specific synchronization primitive or parallel pattern, such as locks or disjoint heaps; real programs often have to combine multiple synchronization primitives and parallel patterns. In this work, we present a new permissions-based type system, which we demonstrate is practical by showing that it supports multiple patterns (e.g., task parallelism, object isolation, array-based parallelism), and by applying it to a suite of non-trivial parallel programs. Our system also has a number of theoretical advances over previous work on permissions-based type systems, including aliased write permissions and a simpler way to store permissions in objects than previous approaches.

languages and compilers for parallel computing | 2010

CnC-CUDA: declarative programming for GPUs

Max Grossman; Alina Simion Sbîrlea; Zoran Budimlic; Vivek Sarkar

The computer industry is at a major inflection point in its hardware roadmap due to the end of a decades-long trend of exponentially increasing clock frequencies. Instead, future computer systems are expected to be built using homogeneous and heterogeneous many-core processors with 10s to 100s of cores per chip, and complex hardware designs to address the challenges of concurrency, energy efficiency and resiliency. Unlike previous generations of hardware evolution, this shift towards many-core computing will have a profound impact on software. These software challenges are further compounded by the need to enable parallelism in workloads and application domains that traditionally did not have to worry about multiprocessor parallelism in the past. A recent trend in mainstream desktop systems is the use of graphics processor units (GPUs) to obtain order-of-magnitude performance improvements relative to general-purpose CPUs. Unfortunately, hybrid programming models that support multithreaded execution on CPUs in parallel with CUDA execution on GPUs prove to be too complex for use by mainstream programmers and domain experts, especially when targeting platforms with multiple CPU cores and multiple GPU devices. n nIn this paper, we extend past work on Intels Concurrent Collections (CnC) programming model to address the hybrid programming challenge using a model called CnC-CUDA. CnC is a declarative and implicitly parallel coordination language that supports flexible combinations of task and data parallelism while retaining determinism. CnC computations are built using steps that are related by data and control dependence edges, which are represented by a CnC graph. The CnC-CUDA extensions in this paper include the definition of multithreaded steps for execution on GPUs, and automatic generation of data and control flow between CPU steps and GPU steps. Experimental results show that this approach can yield significant performance benefits with both GPU execution and hybrid CPU/GPU execution.

international parallel and distributed processing symposium | 2011

Communication Optimizations for Distributed-Memory X10 Programs

Rajkishore Barik; Jisheng Zhao; David Grove; Igor Peshansky; Zoran Budimlic; Vivek Sarkar

X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node Blue Gene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the Blue Gene/P cluster, we observed a maximum performance improvement of 31.46x relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01x (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73x (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.

ieee high performance extreme computing conference | 2016

The Open Community Runtime: A runtime system for extreme scale computing

Timothy G. Mattson; Romain Cledat; Vincent Cavé; Vivek Sarkar; Zoran Budimlic; Sanjay Chatterjee; Joshua B. Fryman; Ivan Ganev; Robin Knauerhase; Min Lee; Benoît Meister; Brian R. Nickerson; Nick Pepperling; Bala Seshasayee; Sagnak Tasirlar; Justin Teller; Nick Vrvilo

The Open Community Runtime (OCR) is a new runtime system designed to meet the needs of extreme-scale computing. While there is growing support for the idea that future execution models will be based on dynamic tasks, there is little agreement on what else should be included. OCR minimally adds events for synchronization and relocatable data-blocks for data management to form a complete system that supports a wide range of higher-level programming models. This paper lays out the fundamental concepts behind OCR and compares OCR performance to that from MPI for two simple benchmarks. OCR has been developed within an open community model with features supporting flexible algorithm expression weighed against the expected realities of extreme-scale computing: power-constrained execution, aggressive growth in the number of compute resources, deepening memory hierarchies and a low mean-time between failures.

Explore More