Cosmin E. Oancea | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cosmin E. Oancea is active.

Explore More

Publication

Featured researches published by Cosmin E. Oancea.

Proceedings of the 1st international workshop on Multicore software engineering | 2008

Software thread-level speculation: an optimistic library implementation

Cosmin E. Oancea; Alan Mycroft

Software thread level speculation (TLS) solutions tend to mirror the hardware ones, in the sense that they employ one, exact dependency-tracking mechanism. Our perspective is that software-flexibility is, perhaps, better exploited by a family of lighter, if less precise speculative models that can be combined together in an effective configuration, which takes advantage of the applications code-patterns. This paper reports on two main contributions. First, it introduces SpLSC: a software TLS model that trades the potential for false-positive violations for a small memory overhead and efficient implementation. Second, it presents PolyLibTLS: a library that encapsulates several lightweight models and enables their composition. In this context, we report on the template meta-programming techniques that we used to achieve performance and safety, while preserving librarys modularity, extensibility and usability properties. Furthermore, we demonstrate that the user-framework interaction is straightforward and present parallelization timing results that validate our high-level perspective.

programming language design and implementation | 2012

Logical inference techniques for loop parallelization

Cosmin E. Oancea; Lawrence Rauchwerger

This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loops memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S=0, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ==> S = 0). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECT-Club and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers.

international symposium on memory management | 2009

A new approach to parallelising tracing algorithms

Cosmin E. Oancea; Alan Mycroft; Stephen M. Watt

Tracing algorithms visit reachable nodes in a graph and are central to activities such as garbage collection, marshaling etc. Traditional sequential algorithms use a worklist, replacing a nodes with their unvisited children. Previous work on parallel tracing is processor-oriented in associating one worklist per processor: worklist insertion and removal requires no locking, and load balancing requires only occasional locking. However, since multiple queues may contain the same node, significant locking is necessary to avoid concurrent visits by competing processors. This paper presents a memory-oriented solution: memory is partitioned into segments and each segment has its own worklist containing only nodes in that segment. At a given time at most one processor owns a given worklist. By arranging separate single-reader-single-writer forwarding queues to pass nodes from processor i to processor j we can process objects in an order that gives lock-free mainline code and improved locality of reference. This refactoring is analogous to the way in which a compiler changes an iteration space to eliminate data dependencies. While it is clear that our solution can be more effective on NUMA systems and even necessary when processor-local memory may not be addressed from other processors, slightly surprisingly, it often gives significantly better speed-up on modern multi-cores architectures too. Using caches to hide memory latency loses much of its effectiveness when there is significant cross-processor memory contention or when locking is necessary.

Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming | 2014

Bounds Checking: An Instance of Hybrid Analysis

Troels Henriksen; Cosmin E. Oancea

This paper presents an analysis for bounds checking of array subscripts that lifts checking assertions to program level under the form of an arbitrarily-complex predicate (inspector), whose runtime evaluation guards the execution of the code of interest. Separating the predicate from the computation makes it more amenable to optimization, and allows it to be split into a cascade of sufficient conditions of increasing complexity that optimizes the common-inspection path. While synthesizing the bounds checking invariant resembles type checking techniques, we rely on compiler simplification and runtime evaluation rather than employing complex inference and annotation systems that might discourage the non-specialist user. We integrate the analysis in the compilers repertoire of Futhark: a purely-functional core language supporting map-reduce nested parallelism on regular arrays, and show how the high-level language invariants enable a relatively straightforward analysis. Finally, we report a qualitative evaluation of our technique on three real-world applications from the financial domain that indicates that the runtime overhead of predicates is negligible.

functional high performance computing | 2012

Financial software on GPUs: between Haskell and Fortran

Cosmin E. Oancea; Christian Andreetta; Jost Berthold; Alain Frisch; Fritz Henglein

This paper presents a real-world pricing kernel for financial derivatives and evaluates the language and compiler tool chain that would allow expressive, hardware-neutral algorithm implementation and efficient execution on graphics-processing units (GPU). The language issues refer to preserving algorithmic invariants, e.g., inherent parallelism made explicit by map-reduce-scan functional combinators. Efficient execution is achieved by manually; applying a series of generally-applicable compiler transformations that allows the generated-OpenCL code to yield speedups as high as 70x and 540x on a commodity mobile and desktop GPU, respectively. Apart from the concrete speed-ups attained, our contributions are twofold: First, from a language perspective;, we illustrate that even state-of-the-art auto-parallelization techniques are incapable of discovering all the requisite data parallelism when rendering the functional code in Fortran-style imperative array processing form. Second, from a performance perspective;, we study which compiler transformations are necessary to map the high-level functional code to hand-optimized OpenCL code for GPU execution. We discover a rich optimization space with nontrivial trade-offs and cost models. Memory reuse in map-reduce patterns, strength reduction, branch divergence optimization, and memory access coalescing, exhibit significant impact individually. When combined, they enable essentially full utilization of all GPU cores. Functional programming has played a crucial double role in our case study: Capturing the naturally data-parallel structure of the pricing algorithm in a transparent, reusable and entirely hardware-independent fashion; and supporting the correctness of the subsequent compiler transformations to a hardware-oriented target language by a rich class of universally valid equational properties. Given the observed difficulty of automatically parallelizing imperative sequential code and the inherent labor of porting hardware-oriented and -optimized programs, our case study suggests that functional programming technology can facilitate high-level; expression of leading-edge performant portable; high-performance systems for massively parallel hardware architectures.

functional high performance computing | 2014

Size slicing: a hybrid approach to size inference in futhark

Troels Henriksen; Martin Elsman; Cosmin E. Oancea

We present a shape inference analysis for a purely-functional language, named Futhark, that supports nested parallelism via array combinators such as map, reduce, filter}, and scan}. Our approach is to infer code for computing precise shape information at run-time, which in the most common cases can be effectively optimized by standard compiler optimizations. Instead of restricting the language or sacrificing ease of use, the language allows the occasional shape-dynamic, and even shape-misbehaving, constructs. Inherently shape-dynamic code is treated with a fall-back technique that preserves, asymptotically, the number of operations of the program and that computes and returns the arrays shape alongside with its value. This approach leads to a shape-dependent system with existentially-quantified types, where static shape inference corresponds to eliminating existential quantifications from the types of program expressions. We optimize the common case to negligible overhead via size slicing: a technique that separates the computation of the arrays shape from its values. This allows the shape to be calculated in advance and to be used to instantiate the previously existentially-quantified shapes of the value slice. We report negligible overhead, on several mini-benchmarks and three real-world applications.

languages and compilers for parallel computing | 2008

Set-Congruence Dynamic Analysis for Thread-Level Speculation (TLS)

Cosmin E. Oancea; Alan Mycroft

The move to multi-core has increased interest in parallelizing sequential programs. Classical dependency-based techniques, although successful for some classes of programs, often fail due to the one-sided (conservative) approximation of program behavior. Thread-level speculation enables increased parallelism by allowing out-of-order execution: correct dependences are ensured by run-time monitoring and possible rollbacks. Two-sided approximations of program behavior are now acceptable if the rollback ratio is kept small. We describe dynamic analyses, based on representing dependencies as modular congruences, which can be employed on-line or off-line. One, thread partitioning , efficiently enables loop iterations to be allocated to threads (and calculates the maximum effective concurrency); the other, fine-grain memory partitioning , calculates a hash function that reduces space overhead and performance loss due to TLS-metadata-based and cache-based task interference.

conference on object-oriented programming systems, languages, and applications | 2005

Parametric polymorphism for software component architectures

Cosmin E. Oancea; Stephen M. Watt

Parametric polymorphism has become a common feature of mainstream programming languages, but software component architectures have lagged behind and do not support it. We examine the problem of providing parametric polymorphism with components combined from different programming languages. We have investigated how to resolve different binding times and parametrization semantics in a range of representative languages and have identified a common ground that can be suitably mapped to different language bindings. We present a generic component architecture extension that provides support for parameterized components and that can be easily adapted to work on top of various software component architectures in use today (e.g., corba, dcom, jni). We have implemented and tested this architecture on top of corba. We also present Generic Interface Definition Language (gidl), an extension to corba-idl supporting generic types and we describe language bindings for C++, Java and Aldor. We explain our implementation of gidl, consisting of a gidl to idl compiler and tools for generating linkage code under the language bindings. We demonstrate how this architecture can be used to access C++s stl and Aldors BasicMath libraries in a multi-language environment and discuss our mappings in the context of automatic library interface generation.

international symposium on symbolic and algebraic computation | 2005

Domains and expressions: an interface between two approaches to computer algebra

Cosmin E. Oancea; Stephen M. Watt

This paper describes a method to use compiled, strongly typed Aldor domains in the interpreted, expression-oriented Maple environment. This represents a non-traditional approach to structuring computer algebra software: using an efficient, compiled language, designed for writing large complex mathematical libraries, together with a top-level system based on user-interface priorities and ease of scripting.We examine what is required to use Aldor libraries to extend Maple in an effective and natural way. Since the computational models of Maple and Aldor differ significantly, new run-time code must implement a non-trivial semantic correspondence. Our solution allows Aldor functions to run tightly coupled to the Maple environment, able to directly and efficiently manipulate Maple data objects. We call the overall system Alma.

languages and compilers for parallel computing | 2011

A Hybrid Approach to Proving Memory Reference Monotonicity

Cosmin E. Oancea; Lawrence Rauchwerger

Array references indexed by non-linear expressions or subscript arrays represent a major obstacle to compiler analysis and to automatic parallelization. Most previous proposed solutions either enhance the static analysis repertoire to recognize more patterns, to infer array-value properties, and to refine the mathematical support, or apply expensive run time analysis of memory reference traces to disambiguate these accesses. This paper presents an automated solution based on static construction of access summaries, in which the reference non-linearity problem can be solved for a large number of reference patterns by extracting arbitrarily-shaped predicates that can (in)validate the reference monotonicity property and thus (dis)prove loop independence. Experiments on six benchmarks show that our general technique for dynamic validation of the monotonicity property can cover a large class of codes, incurs minimal run-time overhead and obtains good speedups.

Explore More