Christoph von Praun | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christoph von Praun is active.

Explore More

Publication

Featured researches published by Christoph von Praun.

conference on object-oriented programming systems, languages, and applications | 2005

X10: an object-oriented approach to non-uniform cluster computing

Philippe Charles; Christian Grothoff; Vijay A. Saraswat; Christopher Michael Donawa; Allan Kielstra; Kemal Ebcioglu; Christoph von Praun; Vivek Sarkar

It is now well established that the device scaling predicted by Moores Law is no longer a viable option for increasing the clock frequency of future uniprocessor systems at the rate that had been sustained during the last two decades. As a result, future systems are rapidly moving from uniprocessor to multiprocessor configurations, so as to use parallelism instead of frequency scaling as the foundation for increased compute capacity. The dominant emerging multiprocessor structure for the future is a Non-Uniform Cluster Computing (NUCC) system with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and interconnected in horizontally scalable cluster configurations such as blade servers. Unlike previous generations of hardware evolution, this shift will have a major impact on existing software. Current OO language facilities for concurrent and distributed programming are inadequate for addressing the needs of NUCC systems because they do not support the notions of non-uniform data access within a node, or of tight coupling of distributed nodes.We have designed a modern object-oriented programming language, X10, for high performance, high productivity programming of NUCC systems. A member of the partitioned global address space family of languages, X10 highlights the explicit reification of locality in the form of places}; lightweight activities embodied in async, future, foreach, and ateach constructs; a construct for termination detection (finish); the use of lock-free synchronization (atomic blocks); and the manipulation of cluster-wide global data structures. We present an overview of the X10 programming model and language, experience with our reference implementation, and results from some initial productivity comparisons between the X10 and Java™ languages.

acm symposium on parallel algorithms and architectures | 2008

RingSTM: scalable transactions with a single atomic instruction

Michael F. Spear; Maged M. Michael; Christoph von Praun

Existing Software Transactional Memory (STM) designs attach metadata to ranges of shared memory; subsequent runtime instructions read and update this metadata in order to ensure that an in-flight transactions reads and writes remain correct. The overhead of metadata manipulation and inspection is linear in the number of reads and writes performed by a transaction, and involves expensive read-modify-write instructions, resulting in substantial overheads. We consider a novel approach to STM, in which transactions represent their read and write sets as Bloom filters, and transactions commit by enqueuing a Bloom filter onto a global list. Using this approach, our RingSTM system requires at most one read-modify-write operation for any transaction, and incurs validation overhead linear not in transaction size, but in the number of concurrent writers who commit. Furthermore, RingSTM is the first STM that is inherently livelock-free and privatization-safe while at the same time permitting parallel writeback by concurrent disjoint transactions.We evaluate three variants of the RingSTM algorithm, and find that it offers superior performance and/or stronger semantics than the state-of-the-art TL2 algorithm under a number of workloads.

acm sigplan symposium on principles and practice of parallel programming | 2006

Programming for parallelism and locality with hierarchically tiled arrays

Ganesh Bikshandi; Jia Guo; Daniel Hoeflinger; Gheorghe Almasi; Basilio B. Fraguela; María Jesús Garzarán; David A. Padua; Christoph von Praun

Tiling has proven to be an effective mechanism to develop high performance implementations of algorithms. Tiling can be used to organize computations so that communication costs in parallel programs are reduced and locality in sequential codes or sequential components of parallel programs is enhanced.In this paper, a data type - Hierarchically Tiled Arrays or HTAs - that facilitates the direct manipulation of tiles is introduced. HTA operations are overloaded array operations. We argue that the implementation of HTAs in sequential OO languages transforms these languages into powerful tools for the development of high-performance parallel codes and codes with high degree of locality. To support this claim, we discuss our experiences with the implementation of HTAs for MATLAB and C++ and the rewriting of the NAS benchmarks and a few other programs into HTA-based parallel form.

acm sigplan symposium on principles and practice of parallel programming | 2007

Implicit parallelism with ordered transactions

Christoph von Praun; Luis Ceze; Calin Cascaval

Implicit Parallelism with Ordered Transactions (IPOT) is an extension of sequential or explicitly parallel programming models to support speculative parallelization. The key idea is to specify opportunities for parallelization in a sequential program using annotations similar to transactions. Unlike explicit parallelism, IPOT annotations do not require the absence of data dependence, since the parallelization relies on runtime support for speculative execution. IPOT as a parallel programming model is determinate, i.e., program semantics are independent of the thread scheduling. For optimization, non-determinism can be introduced selectively. We describe the programming model of IPOT and an online tool that recommends boundaries of ordered transactions by observing a sequential execution. On three example HPC workloads we demonstrate that our method is effective in identifying opportunities for fine-grain parallelization. Using the automated task recommendation tool, we were able to perform the parallelization of each program within a few hours.

acm sigplan symposium on principles and practice of parallel programming | 2007

X10: concurrent programming for modern architectures

Vijay A. Saraswat; Vivek Sarkar; Christoph von Praun

Two major trends are converging to reshape the landscape of concurrent object-oriented programming languages. First, trends in modern architectures (multi-core, accelerators, high performance clusters such as Blue Gene) are making concurrency and distribution inescapable for large classes of OO programmers. Second, experience with first-generation concurrent OO languages (e.g. Java threads and synchronization) have revealed several drawbacks of unstructured threads with lock-based synchronization. X10 is a second generation OO language designed to address both programmer productivity and parallel performance for modern architectures. It extends sequential Java with a handful of constructs for concurrency and distribution. It introduces a clustered address space to deal with distribution. A computation is thought of as running at multiple places, with many simultaneous activities operating in each place. Objects and activities once created in a particular place stay confined to that place. However, a data-structure (object) allocated in one place may contain a reference to an object allocated in anoter place. (Thus X10 supports a partitioned global address space. X10 is an explicitly concurrent language. It provides constructs for lightweight asynchrony, making it easy for programmers to write code for target architectures that provide massive parallelism. It provides for recursive fork-join parallelism for structured concurrency. It provides for termination detection so that collections of activities may be reliably sequenced (even if they run across multiple places). It provides for a very simple form of atomic blocks in lieu of locks for mutual exclusion. These constructs can be used to define more sophisticated synchronization constructs such as futures and clocks. X10 supports a rich notion of multi-dimensional index spaces (regions), together with a rich set of operations on regions. Regions are first-class data-structures -- they can be produced dynamically, stored in data-structures, passed around in method invocations etc. A distributed version of regions (distributions) is also defined. It specifies a mapping of every point in the underlying region to a place. An array is simply a mapping from a distribution to backing store of the given type, partitioned across various places in the manner described by the distribution. Rank-generic programming is supported through generic points. Parallel iteration constructs are also provided. X10 provides a rich framework for constraint-based value-dependent types. The programmer may specify types -- such as the type of square arrays of doubles of rank 2 -- which reference run-time constant (final) values. Classes and interfaces can be parametrized with properties, which are to be thought of as final instance fields. A dependent type is merely a constraint over these properties. Types are checked statically; this requires the compiler to use a constraint-solver. The design of the type-system and the implementation is modular so that a new constraint system can be defined and plugged into the language in a fairly routine fashion. Dynamic casts are also provided -- this permits an object to be checked at runtime for conformance to a dependent type. The compiler takes care of generating run-time code for performing such tests. The tutorial illustrate how common design patterns for concurrency and distribution can be naturally expressed in X10 (wait-free algorithms, data-flow synchronization, streaming parallelism, co-processor parallelism, hierarchical task-parallelism and phased computations). It shows design patterns for establishing that programs are determinate and/or deadlock-free. Examples are drawn from high-performance computing and middleware (transactions, event-driven computing). Participants will be encouraged to download the X10 implementation from SourceForge http://x10.sf.net. The source code for the implementation is released under the Eclipse Public Licence. The implementation consists of a translator from X10 to Java, and a multi-threaded runtime system in Java. Resulting programs may be run on any SMP that supports a Java Virtual Machine.

acm sigplan symposium on principles and practice of parallel programming | 2008

Modeling optimistic concurrency using quantitative dependence analysis

Christoph von Praun; Rajesh Bordawekar; Calin Cascaval

This work presents a quantitative approach to analyze parallelization opportunities in programs with irregular memory access where potential data dependencies mask available parallelism. The model captures data and causal dependencies among critical sections as algorithmic properties and quantifies them as a density computed over the number of executed instructions. The model abstracts from runtime aspects such as scheduling, the number of threads, and concurrency control used in a particular parallelization. We illustrate the model on several applications requiring ordered and unordered execution of critical sections. We describe a run-time tool that computes the dependence densities from a deterministic single-threaded program execution. This density metric provides insights into the potential for optimistic parallelization, opportunities for algorithmic scheduling, and performance defects due to synchronization bottlenecks. Based on the results of our analysis, we classify applications into three categories with low, medium, and high dependence densities. Applications with low dependence density are naturally good candidates for optimistic concurrency, applications with medium density may require a scheduler that is aware of the algorithmic dependencies for optimistic concurrency to be effective, and applications with high dependence density may not be suitable for parallelization.

international symposium on computer architecture | 2006

Conditional Memory Ordering

Christoph von Praun; Harold W. Cain; Jong-Deok Choi; Kyung Dong Ryu

Conventional relaxed memory ordering techniques follow a proactive model: at a synchronization point, a processor makes its own updates to memory available to other processors by executing a memory barrier instruction, ensuring that recent writes have been ordered with respect to other processors in the system. We show that this model leads to superfluous memory barriers in programs with acquire-release style synchronization, and present a combined hardware/software synchronization mechanism called conditional memory ordering (CMO) that reduces memory ordering overhead. CMO is demonstrated on a lock algorithm that identifies those dynamic lock/unlock operations for which memory ordering is unnecessary, and speculatively omits the associated memory ordering instructions. When ordering is required, this algorithm relies on a hardware mechanism for initiating a memory ordering operation on another processor. Based on evaluation using a software-only CMO prototype, we show that CMO avoids memory ordering operations for the vast majority of dynamic acquire and release operations across a set of multithreaded Java workloads, leading to significant speedups for many. However, performance improvements in the software prototype are hindered by the high cost of remote memory ordering. Using empirical data, we construct an analytical model demonstrating the benefits of a combined hardware-software implementation.

languages and compilers for parallel computing | 2006

Design and use of htalib: a library for hierarchically tiled arrays

Ganesh Bikshandi; Jia Guo; Christoph von Praun; Gabriel Tanase; Basilio B. Fraguela; María Jesús Garzarán; David A. Padua; Lawrence Rauchwerger

Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with block-recursive nature. The model underlying HTAs provides programmers with a global view of distributed data as well as a single-threaded view of the execution. In this paper we present htalib, a C++ implementation of HTAs. This library provides several novel constructs: (i) A map-reduce operator framework that facilitates the implementation of distributed operations with HTAs. (ii) Overlapped tiling in support of tiling in stencil codes. (iii) Data layering, facilitating the use of HTAs in adaptive mesh refinement applications. We describe the interface and design of htalib and our experience with the new programming constructs.

parallel computing | 2012

Optimization techniques for efficient HTA programs

Basilio B. Fraguela; Ganesh Bikshandi; Jia Guo; María Jesús Garzarán; David A. Padua; Christoph von Praun

Object oriented languages can be easily extended with new data types, which facilitate prototyping new language extensions. A very challenging problem is the development of data types encapsulating data parallel operations, which could improve parallel programming productivity. However, the use of class libraries to implement data types, particularly when they encapsulate parallelism, comes at the expense of performance overhead. This paper describes our experience with the implementation of a C++ data type called hierarchically tiled array (HTA). This object includes data parallel operations and allows the manipulation of tiles to facilitate developing efficient parallel codes and codes with high degree of locality. The initial performance of the HTA programs we wrote was lower than that of their conventional MPI-based counterparts. The overhead was due to factors such as the creation of temporary HTAs and the inability of the compiler to properly inline index computations, among others. We describe the performance problems and the optimizations applied to overcome them as well as their impact on programmability. After the optimization process, our HTA-based implementations run only slightly slower than the MPI-based codes while having much better programmability metrics.

architectural support for programming languages and operating systems | 2008

Concurrency control with data coloring

Luis Ceze; Christoph von Praun; Călin Caşcaval; Pablo Montesinos; Josep Torrellas

Concurrency control is one of the main sources of error and complexity in shared memory parallel programming. While there are several techniques to handle concurrency control such as locks and transactional memory, simplifying concurrency control has proved elusive. In this paper we introduce the Data Coloring programming model, based on the principles of our previous work on architecture support for data-centric synchronization. The main idea is to group data structures into consistency domains and mark places in the control flow where data should be consistent. Based on these annotations, the system dynamically infers transaction boundaries. An important aspect of data coloring is that the occurrence of a synchronization defect is typically determinate and leads to a violation of liveness rather than to a safety violation. Finally, this paper includes empirical data that shows that most of the critical sections in large applications are used in a data-centric manner.

Explore More