Olav Beckmann | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Olav Beckmann is active.

Explore More

Publication

Featured researches published by Olav Beckmann.

Lecture Notes in Computer Science | 2004

Runtime Code Generation in C++ as a Foundation for Domain-Specific Optimisation

Olav Beckmann; Alastair Houghton; Michael R. Mellor; Paul H. J. Kelly

The TaskGraph Library is a C++ library for dynamic code generation, which combines specialisation with dependence analysis and loop restructuring. A TaskGraph represents a fragment of code which is constructed and manipulated at runtime, then compiled, dynamically linked and executed. TaskGraphs are initialised using macros and overloading, which forms a simplified, C-like sub-language with first-class arrays and no pointers. Once a TaskGraph has been constructed, we can analyse its dependence structure and perform optimisations. In this Chapter, we present the design of the TaskGraph library, and two sample applications to demonstrate its use for runtime code specialisation and restructuring optimisation.

Concurrency and Computation: Practice and Experience | 2006

Is Morton layout competitive for large two-dimensional arrays yet?

Jeyarajan Thiyagalingam; Olav Beckmann; Paul H. J. Kelly

Two‐dimensional arrays are generally arranged in memory in row‐major order or column‐major order. Traversing a row‐major array in column‐major order, or vice versa, leads to poor spatial locality. With large arrays the performance loss can be a factor of 10 or more. This paper explores the Morton storage layout, which has substantial spatial locality whether traversed in row‐major or column‐major order. Using a small suite of dense kernels working on two‐dimensional arrays, we have carried out an extensive study of the impact of poor array layout and of whether Morton layout can offer an attractive compromise. We show that Morton layout can lead to better performance than the worse of the two canonical layouts; however, the performance of Morton layout compared to the better choice of canonical layout is often disappointing. We further study one simple improvement of the basic Morton scheme: we show that choosing the correct alignment for the base address of an array in Morton layout can sometimes significantly improve the competitiveness of this layout. Copyright

field-programmable logic and applications | 2006

Comparing FPGAs to Graphics Accelerators and the Playstation 2 Using a Unified Source Description

Lee W. Howes; Paul Price; Oskar Mencer; Olav Beckmann; Oliver Pell

Field programmable gate arrays (FPGAs), graphics processing units (GPUs) and Sonys Playstation 2 vector units offer scope for hardware acceleration of applications. We compare the performance of these architectures using a unified description based on A Stream Compiler (ASC) for FPGAs, which has been extended to target GPUs and PS2 vector units. Programming these architectures from a single description enables us to reason about optimizations for the different architectures. Using the ASC description we implement a Monte Carlo simulation, a fast Fourier transform (FFT) and a weighted sum algorithm. Our results show that without much optimization the GPU is suited to the Monte Carlo simulation, while the weighted sum is better suited to PS2 vector units. FPGA implementations benefit particularly from architecture specific optimizations which ASC allows us to easily implement by adding simple annotations to the shared code.

Parallel Processing Letters | 2001

THEMIS: COMPONENT DEPENDENCE METADATA IN ADAPTIVE PARALLEL APPLICATIONS

Paul H. J. Kelly; Olav Beckmann; Tony Field; Scott B. Baden

There is a conflict between the goals of improving the quality of scientific software and improving its performance. A key issue is to support reuse and re-assembly of sophisticated software components without compromising performance. This paper describes THEMIS, a programming model and run time library being designed to support cross-component performance optimisation through explicit manipulation of the computation’s iteration space at run-time. Each component is augmented with “component dependence metadata”, which characterises the constraints on its execution order, data distribution and memory access order. We show how this supports dynamic adaptation of each component to exploit the available resources, the context in which its operands are generated, and results are used, and the evolution of the problem instance. Using a computational fluid dynamics visualisation example as motivation, we show how component dependence metadata provides a framework in which a number of interesting optimisations become possible. Examples include data placement optimisation, loop fusion, tiling, memoisation, checkpointing and incrementalisation.

european conference on parallel processing | 2002

Delayed Evaluation, Self-optimising Software Components as a Programming Model

Peter Liniker; Olav Beckmann; Paul H. J. Kelly

We argue that delayed-evaluation, self-optimising scientific software components, which dynamically change their behaviour according to their calling context at runtime offer a possible way of bridging the apparent conflict between the quality of scientific software and its performance. Rather than equipping scientific software components with a performance interface which allows the caller to supply the context information that is lost when building abstract software components, we propose to recapture this lost context information at runtime. This paper is accompanied by a public release of a parallel linear algebra library with both C and C++ language interfaces which implements this proposal. We demonstrate the usability of this library by showing that it can be used to supply linear algebra component functionality to an existing external software package. We give preliminary performance figures and discuss avenues for future work.

Lecture Notes in Computer Science | 1998

Efficient Interprocedural Data Placement Optimisation in a Parallel Library

Olav Beckmann; Paul H. J. Kelly

This paper describes a combination of methods which make interprocedural data placement optimisation available to parallel libraries. We propose a delayed-evaluation, self-optimising (DESO) numerical library for a distributed-memory multicomputer. Delayed evaluation allows us to capture the control-flow of a user program from within the library at runtime, and to construct an optimised execution plan by propagating data placement constraints backwards through the DAG representing the computation to be performed. Our strategy for optimising data placements at runtime consists of an efficient representation for data distributions, a greedy optimisation algorithm, which because of delayed evaluation can take account of the full context of operations, and of re-using the results of previous runtime optimisations on contexts we have encountered before. We show performance figures for our library on a cluster of Pentium II Linux workstations, which demonstrate that the overhead of our delayed evaluation method is very small, and which show both the parallel speedup we obtain and the benefit of the optimisations we describe.

Science of Computer Programming | 2011

DESOLA: An active linear algebra library using delayed evaluation and runtime code generation

Francis P. Russell; Michael R. Mellor; Paul H. J. Kelly; Olav Beckmann

Active libraries can be defined as libraries which play an active part in the compilation, in particular, the optimisation of their client code. This paper explores the implementation of an active dense linear algebra library by delaying evaluation of expressions built using library calls, then generating code at runtime for the compositions that occur. The key optimisations in this context are loop fusion and array contraction. Our prototype C++ implementation, DESOLA, automatically fuses loops arising from different client calls, identifies unnecessary intermediate temporaries, and contracts temporary arrays to scalars. Performance is evaluated using a benchmark suite of linear solvers from ITL (Iterative Template Library), and is compared with MTL (Matrix Template Library), ATLAS (Automatically Tuned Linear Algebra) and IMKL (Intel Math Kernel Library). Excluding runtime compilation overheads (caching means they occur only on the first iteration), for larger matrix sizes, performance matches or exceeds MTL; when fusion of matrix operations occurs, performance exceeds that of ATLAS and IMKL.

international parallel and distributed processing symposium | 2006

Automatically translating a general purpose C++ image processing library for GPUs

Jay L. T. Cornwall; Olav Beckmann; Paul H. J. Kelly

This paper presents work-in-progress towards a C++ source-to-source translator that automatically seeks parallelizable code fragments and replaces them with code for a graphics co-processor. We report on our experience with accelerating an industrial image processing library. To increase the effectiveness of our approach, we exploit some domain-specific knowledge of the librarys semantics. We outline the architecture of our translator and how it uses the ROSE source-to-source transformation library to overcome complexities in the C++ language. Techniques for parallel analysis and source transformation are presented in light of their uses in GPU code generation. We conclude with results from a performance evaluation of two examples, image blending and an erosion filter, hand-translated with our parallelization techniques. We show that our approach has potential and explain some of the remaining challenges in building an effective tool

Parallel Processing Letters | 2005

GENERATIVE AND ADAPTIVE METHODS IN PERFORMANCE PROGRAMMING

Paul H. J. Kelly; Olav Beckmann

Performance programming is characterized by the need to structure software components to exploit the context of use. Relevant context includes the target processor architecture, the available resources (number of processors, network capacity), prevailing resource contention, the values and shapes of input and intermediate data structures, the schedule and distribution of input data delivery, and the way the results are to be used. This paper concerns adapting to dynamic context: adaptive algorithms, malleable and migrating tasks, and application structures based on dynamic component composition. Adaptive computations use metadata associated with software components — performance models, dependence information, data size and shape. Computation itself is interwoven with planning and optimizing the computation process, using this metadata. This reflective nature motivates metaprogramming techniques. We present a research agenda aimed at developing a modelling framework which allows us to characterize both computation and dynamic adaptation in a way that allows systematic optimization.

european conference on parallel processing | 1997

Runtime Interprocedural Data Placement Optimisation for Lazy Parallel Libraries (Extended Abstract)

Olav Beckmann; Paul H. J. Kelly

We are developing a lazy, self-optimising parallel library of vector-matrix routines. The aim is to allow users to parallelise certain computationally expensive parts of numerical programs by simply linking with a parallel rather than sequential library of subroutines. The library performs interprocedural data placement optimisation at runtime, which requires the optimiser itself to be very efficient. We achieve this firstly by working from aggregate loop nests which have been optimised in isolation, and secondly by using a carefully constructed mathematical formulation for data distributions and the distribution requirements of library operators, which allows us largely to replace searching with calculation in our algorithm.

Explore More