Allan Kielstra | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Allan Kielstra is active.

Explore More

Publication

Featured researches published by Allan Kielstra.

conference on object-oriented programming systems, languages, and applications | 2005

X10: an object-oriented approach to non-uniform cluster computing

Philippe Charles; Christian Grothoff; Vijay A. Saraswat; Christopher Michael Donawa; Allan Kielstra; Kemal Ebcioglu; Christoph von Praun; Vivek Sarkar

It is now well established that the device scaling predicted by Moores Law is no longer a viable option for increasing the clock frequency of future uniprocessor systems at the rate that had been sustained during the last two decades. As a result, future systems are rapidly moving from uniprocessor to multiprocessor configurations, so as to use parallelism instead of frequency scaling as the foundation for increased compute capacity. The dominant emerging multiprocessor structure for the future is a Non-Uniform Cluster Computing (NUCC) system with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and interconnected in horizontally scalable cluster configurations such as blade servers. Unlike previous generations of hardware evolution, this shift will have a major impact on existing software. Current OO language facilities for concurrent and distributed programming are inadequate for addressing the needs of NUCC systems because they do not support the notions of non-uniform data access within a node, or of tight coupling of distributed nodes.We have designed a modern object-oriented programming language, X10, for high performance, high productivity programming of NUCC systems. A member of the partitioned global address space family of languages, X10 highlights the explicit reification of locality in the form of places}; lightweight activities embodied in async, future, foreach, and ateach constructs; a construct for termination detection (finish); the use of lock-free synchronization (atomic blocks); and the manipulation of cluster-wide global data structures. We present an overview of the X10 programming model and language, experience with our reference implementation, and results from some initial productivity comparisons between the X10 and Java™ languages.

virtual execution environments | 2005

Inlining java native calls at runtime

Levon Stepanian; Angela Demke Brown; Allan Kielstra; Gita Koblents; Kevin A. Stoodley

We introduce a strategy for inlining native functions into Java™ applications using a JIT compiler. We perform further optimizations to transform inlined callbacks into semantically equivalent lightweight operations. We show that this strategy can substantially reduce the overhead of performing JNI calls, while preserving the key safety and portability properties of the JNI. Our work leverages the ability to store statically-generated IL alongside native binaries, to facilitate native inlining at Java callsites at JIT compilation time. Preliminary results with our prototype implementation show speedups of up to 93X when inlining and callback transformation are combined.

compiler construction | 2012

Compiler support for fine-grain software-only checkpointing

Chuck (Chengyan) Zhao; J. Gregory Steffan; Cristiana Amza; Allan Kielstra

Checkpointing support allows program execution to roll-back to an earlier program point, discarding any modifications made since that point. Existing software-based checkpointing methods are mainly libraries that snapshot all of working-memory, and hence have prohibitive overhead for many potential applications. In this paper we present a light-weight, fine-grain checkpointing framework implemented entirely in software through compiler transformations and optimizations. A programmer can specify arbitrary checkpoint regions via a simple API, and the compiler automatically transforms the code to implement the checkpoint at the granularity of individual stores, optimizing to remove redundancy. We explore two application areas for this support. First, we investigate its application to debugging, in particular by providing the ability to rewind to an arbitrarily-placed point in a buggy programs execution. A study using BugBench applications shows that our compiler-based approach is more than 100x less overhead than full-process checkpointing. Second, we demonstrate that compiler-based checkpointing support can be leveraged to free the programmer from manually implementing and maintaining software rollback mechanisms when coding a back-tracking algorithm, with runtime overhead of only 15% compared to the manual implementation.

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness | 2011

There is nothing wrong with out-of-thin-air: compiler optimization and memory models

Clark Verbrugge; Allan Kielstra; Yi Zhang

Memory models are used in concurrent systems to specify visibility properties of shared data. A practical memory model, however, must permit code optimization as well as provide a useful semantics for programmers. Here we extend recent observations that the current Java memory model imposes significant restrictions on the ability to optimize code. Beyond the known and potentially correctable proof concerns illustrated by others we show that major constraints on code generation and optimization can in fact be derived from fundamental properties and guarantees provided by the memory model. To address this and accommodate a better balance between programmability and optimization we present ideas for a simple concurrency semantics for Java that avoids basic problems at a cost of backward compatibility.

Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems | 2016

Exhaustive analysis of thread-level speculation

Clark Verbrugge; Christopher J. F. Pickett; Alexander Krolik; Allan Kielstra

Thread-level Speculation (TLS) is a technique for automatic parallelization. The complexity of even prototype implementations, however, limits the ability to explore and compare the wide variety of possible design choices, and also makes understanding performance characteristics difficult. In this work we build a general analytical model of the method-level variant of TLS which we can use for determining program speedup under a wide range of TLS designs. Our approach is exhaustive, and using either simple brute force or more efficient dynamic programming implementations we are able to show how performance is strongly limited by program structure, as well as core choices in speculation design, irrespective of and complementary to the impact of data-dependencies. These results provide new, high-level insight into where and how thread-level speculation can and should be applied in order to produce practical speedup.

VM'04 Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3 | 2004