Lukas Stadler
Oracle Corporation
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lukas Stadler.
programming language design and implementation | 2017
Thomas Würthinger; Christian Wimmer; Christian Humer; Andreas Wöß; Lukas Stadler; Chris Seaton; Gilles Duboscq; Doug Simon; Matthias Grimmer
Most high-performance dynamic language virtual machines duplicate language semantics in the interpreter, compiler, and runtime system. This violates the principle to not repeat yourself. In contrast, we define languages solely by writing an interpreter. The interpreter performs specializations, e.g., augments the interpreted program with type information and profiling information. Compiled code is derived automatically using partial evaluation while incorporating these specializations. This makes partial evaluation practical in the context of dynamic languages: It reduces the size of the compiled code while still compiling all parts of an operation that are relevant for a particular program. When a speculation fails, execution transfers back to the interpreter, the program re-specializes in the interpreter, and later partial evaluation again transforms the new state of the interpreter to compiled code. We evaluate our approach by comparing our implementations of JavaScript, Ruby, and R with best-in-class specialized production implementations. Our general-purpose compilation system is competitive with production systems even when they have been heavily optimized for the one language they support. For our set of benchmarks, our speedup relative to the V8 JavaScript VM is 0.83x, relative to JRuby is 3.8x, and relative to GNU R is 5x.
dynamic languages symposium | 2016
Lukas Stadler; Adam Welc; Christian Humer; Mick Jordan
The R language, from the point of view of language design and implementation, is a unique combination of various programming language concepts. It has functional characteristics like lazy evaluation of arguments, but also allows expressions to have arbitrary side effects. Many runtime data structures, for example variable scopes and functions, are accessible and can be modified while a program executes. Several different object models allow for structured programming, but the object models can interact in surprising ways with each other and with the base operations of R. R works well in practice, but it is complex, and it is a challenge for language developers trying to improve on the current state-of-the-art, which is the reference implementation -- GNU R. The goal of this work is to demonstrate that, given the right approach and the right set of tools, it is possible to create an implementation of the R language that provides significantly better performance while keeping compatibility with the original implementation. In this paper we describe novel optimizations backed up by aggressive speculation techniques and implemented within FastR, an alternative R language implementation, utilizing Truffle -- a JVM-based language development framework developed at Oracle Labs. We also provide experimental evidence demonstrating effectiveness of these optimizations in comparison with GNU R, as well as Renjin and TERR implementations of the R language.
ACM Transactions on Architecture and Code Optimization | 2015
Doug Simon; Christian Wimmer; Bernhard Urban; Gilles Duboscq; Lukas Stadler; Thomas Würthinger
When building a compiler for a high-level language, certain intrinsic features of the language must be expressed in terms of the resulting low-level operations. Complex features are often expressed by explicitly weaving together bits of low-level IR, a process that is tedious, error prone, difficult to read, difficult to reason about, and machine dependent. In the Graal compiler for Java, we take a different approach: we use snippets of Java code to express semantics in a high-level, architecture-independent way. Two important restrictions make snippets feasible in practice: they are compiler specific, and they are explicitly prepared and specialized. Snippets make Graal simpler and more portable while still capable of generating machine code that can compete with other compilers of the Java HotSpot VM.
virtual execution environments | 2017
Juan José Fumero; Michel Steuwer; Lukas Stadler; Christophe Dubach
Computer systems are increasingly featuring powerful parallel devices with the advent of many-core CPUs and GPUs. This offers the opportunity to solve computationally-intensive problems at a fraction of the time traditional CPUs need. However, exploiting heterogeneous hardware requires the use of low-level programming language approaches such as OpenCL, which is incredibly challenging, even for advanced programmers. On the application side, interpreted dynamic languages are increasingly becoming popular in many domains due to their simplicity, expressiveness and flexibility. However, this creates a wide gap between the high-level abstractions offered to programmers and the low-level hardware-specific interface. Currently, programmers must rely on high performance libraries or they are forced to write parts of their application in a low-level language like OpenCL. Ideally, nonexpert programmers should be able to exploit heterogeneous hardware directly from their interpreted dynamic languages. In this paper, we present a technique to transparently and automatically offload computations from interpreted dynamic languages to heterogeneous devices. Using just-in-time compilation, we automatically generate OpenCL code at runtime which is specialized to the actual observed data types using profiling information. We demonstrate our technique using R, which is a popular interpreted dynamic language predominately used in big data analytic. Our experimental results show the execution on a GPU yields speedups of over 150x compared to the sequential FastR implementation and the obtained performance is competitive with manually written GPU code. We also show that when taking into account start-up time, large speedups are achievable, even when the applications run for as little as a few seconds.
symposium on code generation and optimization | 2018
David Leopoldseder; Lukas Stadler; Thomas Würthinger; Josef Eisl; Doug Simon
Compilers perform a variety of advanced optimizations to improve the quality of the generated machine code. However, optimizations that depend on the data flow of a program are often limited by control-flow merges. Code duplication can solve this problem by hoisting, i.e. duplicating, instructions from merge blocks to their predecessors. However, finding optimization opportunities enabled by duplication is a non-trivial task that requires compile-time intensive analysis. This imposes a challenge on modern (just-in-time) compilers: Duplicating instructions tentatively at every control flow merge is not feasible because excessive duplication leads to uncontrolled code growth and compile time increases. Therefore, compilers need to find out whether a duplication is beneficial enough to be performed. This paper proposes a novel approach to determine which duplication operations should be performed to increase performance. The approach is based on a duplication simulation that enables a compiler to evaluate different success metrics per potential duplication. Using this information, the compiler can then select the most promising candidates for optimization. We show how to map duplication candidates into an optimization cost model that allows us to trade-off between different success metrics including peak performance, code size and compile time. We implemented the approach on top of the GraalVM and evaluated it with the benchmarks Java DaCapo, Scala DaCapo, JavaScript Octane and a micro-benchmark suite, in terms of performance, compilation time and code size increase. We show that our optimization can reach peak performance improvements of up to 40% with a mean peak performance increase of 5.89%, while it generates a mean code size increase of 9.93% and mean compile time increase of 18.44%.
Proceedings of the 15th International Conference on Managed Languages & Runtimes | 2018
David Leopoldseder; Roland Schatz; Lukas Stadler; Manuel Rigger; Thomas Würthinger
Java programs can contain non-counted loops, that is, loops for which the iteration count can neither be determined at compile time nor at run time. State-of-the-art compilers do not aggressively optimize them, since unrolling non-counted loops often involves duplicating also a loops exit condition, which thus only improves run-time performance if subsequent compiler optimizations can optimize the unrolled code. This paper presents an unrolling approach for non-counted loops that uses simulation at run time to determine whether unrolling such loops enables subsequent compiler optimizations. Simulating loop unrolling allows the compiler to determine performance and code size effects for each potential transformation prior to performing it. We implemented our approach on top of the GraalVM, a high-performance virtual machine for Java, and evaluated it with a set of Java and JavaScript benchmarks in terms of peak performance, compilation time and code size increase. We show that our approach can improve performance by up to 150% while generating a median code size and compile-time increase of not more than 25%. Our results indicate that fast-path unrolling of non-counted loops can be used in practice to increase the performance of Java applications.
dynamic languages symposium | 2015
David Leopoldseder; Lukas Stadler; Christian Wimmer
We present an approach to cross-compile Java bytecodes to Java-Script, building on existing Java optimizing compiler technology. Static analysis determines which Java classes and methods are reachable. These are then translated to JavaScript using a re-configured Java just-in-time compiler with a new back end that generates JavaScript instead of machine code. Standard compiler optimizations such as method inlining and global value numbering, as well as advanced optimizations such as escape analysis, lead to compact and optimized JavaScript code. Compiler IR is unstructured, so structured control flow needs to be reconstructed before code generation is possible. We present details of our control flow reconstruction algorithm. Our system is based on Graal, an open-source optimizing compiler for the Java HotSpot VM and other VMs. The modular and VM-independent architecture of Graal allows us to reuse the intermediate representation, the bytecode parser, and the high-level optimizations. Our custom back end first performs control flow reconstruction and then JavaScript code generation. The generated JavaScript undergoes a set of optimizations to increase readability and performance. Static analysis is performed on the Graal intermediate representation as well. Benchmark results for medium-sized Java benchmarks such as SPECjbb2005 run with acceptable performance on the V8 JavaScript VM.
symposium on cloud computing | 2018
Andreas Kunft; Lukas Stadler; Daniele Bonetta; Cosmin Basca; Jens Meiners; Sebastian Breß; Tilmann Rabl; Juan Fumero; Volker Markl
To cope with todays large scale of data, parallel dataflow engines such as Hadoop, and more recently Spark and Flink, have been proposed. They offer scalability and performance, but require data scientists to develop analysis pipelines in unfamiliar programming languages and abstractions. To overcome this hurdle, dataflow engines have introduced some forms of multi-language integrations, e.g., for Python and R. However, this results in data exchange between the dataflow engine and the integrated language runtime, which requires inter-process communication and causes high runtime overheads. In this paper, we present ScootR, a novel approach to execute R in dataflow systems. ScootR tightly integrates the dataflow and R language runtime by using the Truffle framework and the Graal compiler. As a result, ScootR executes R scripts directly in the Flink data processing engine, without serialization and inter-process communication. Our experimental study reveals that ScootR outperforms state-of-the-art systems by up to an order of magnitude.
Proceedings of the 10th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages - VMIL 2018 | 2018
David Leopoldseder; Lukas Stadler; Manuel Rigger; Thomas Würthinger
Compilers provide many architecture-agnostic, high-level optimizations trading off peak performance for code size. High-level optimizations typically cannot precisely reason about their impact, as they are applied before the final shape of the generated machine code can be determined. However, they still need a way to estimate their transformation’s impact on the performance of a compilation unit. Therefore, compilers typically resort to modelling these estimations as trade-off functions that heuristically guide optimization decisions. Compilers such as Graal implement many such handcrafted heuristic trade-off functions, which are tuned for one particular high-level optimization. Heuristic trade-off functions base their reasoning on limited knowledge of the compilation unit, often causing transformations that heavily increase code size or even decrease performance. To address this problem, we propose a cost model for Graal’s high-level intermediate representation that models relative operation latencies and operation sizes in order to be used in trade-off functions of compiler optimizations. We implemented the cost model in Graal and used it in two code-duplication-based optimizations. This allowed us to perform a more fine-grained code size trade-off in existing compiler optimizations, reducing the code size increase of our optimizations by up to 50% compared to not using the proposed cost model in these optimizations, without sacrificing performance. Our evaluation demonstrates that the cost model allows optimizations to perform fine-grained code size and performance trade-offs outperforming hard-coded heuristics.
principles and practice of programming in java | 2014
Matthias Grimmer; Manuel Rigger; Roland Schatz; Lukas Stadler