Björn Franke | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Björn Franke is active.

Explore More

Publication

Featured researches published by Björn Franke.

languages and compilers for parallel computing | 2009

Reducing training time in a one-shot machine learning-based compiler

John Thomson; Michael F. P. O'Boyle; Grigori Fursin; Björn Franke

Iterative compilation of applications has proved a popular and successful approach to achieving high performance. This however, is at the cost of many runs of the application. Machine learning based approaches overcome this at the expense of a large off-line training cost. This paper presents a new approach to dramatically reduce the training time of a machine learning based compiler. This is achieved by focusing on the programs which best characterize the optimization space. By using unsupervised clustering in the program feature space we are able to dramatically reduce the amount of time required to train a compiler. Furthermore, we are able to learn a model which dispenses with iterative search completely allowing integration within the normal program development cycle. We evaluated our clustering approach on the EEMBCv2 benchmark suite and show that we can reduce the number of training runs by more than a factor of 7. This translates into an average 1.14 speedup across the benchmark suite compared to the default highest optimization level.

symposium on code generation and optimization | 2006

Using Machine Learning to Focus Iterative Optimization

Felix Agakov; Edwin V. Bonilla; John Cavazos; Björn Franke; Grigori Fursin; Michael F. P. O'Boyle; John Thomson; Marc Toussaint; Christopher K. I. Williams

Iterative compiler optimization has been shown to outperform static approaches. This, however, is at the cost of large numbers of evaluations of the program. This paper develops a new methodology to reduce this number and hence speed up iterative optimization. It uses predictive modelling from the domain of machine learning to automatically focus search on those areas likely to give greatest performance. This approach is independent of search algorithm, search space or compiler infrastructure and scales gracefully with the compiler optimization space size. Off-line, a training set of programs is iteratively evaluated and the shape of the spaces and program features are modelled. These models are learnt and used to focus the iterative optimization of a new program. We evaluate two learnt models, an independent and Markov model, and evaluate their worth on two embedded platforms, the Texas Instrument C67I3 and the AMD Au1500. We show that such learnt models can speed up iterative search on large spaces by an order of magnitude. This translates into an average speedup of 1.22 on the TI C6713 and 1.27 on the AMD Au1500 in just 2 evaluations.

programming language design and implementation | 2009

Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping

Georgios Tournavitis; Zheng Wang; Björn Franke; Michael F. P. O'Boyle

Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach, resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to identify more application parallelism and only rely on the user for final approval. In addition, we replace the traditional target-specific and inflexible mapping heuristics with a machine-learning based prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architectures. We have evaluated our parallelization strategy against the NAS and SPEC OMP benchmarks and two different multi-core platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields significant improvements when compared with state-of-the-art parallelizing compilers, but comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell platform, demonstrating the potential of profile-guided and machine-learning based parallelization for complex multi-core platforms.

computing frontiers | 2007

Fast compiler optimisation evaluation using code-feature based performance prediction

Christophe Dubach; John Cavazos; Björn Franke; Grigori Fursin; Michael F. P. O'Boyle; Olivier Temam

Performance tuning is an important and time consuming task which may have to be repeated for each new application and platform. Although iterative optimisation can automate this process, it still requires many executions of different versions of the program. As execution time is frequently the limiting factor in the number of versions or transformed programs that can be considered, what is needed is a mechanism that can automatically predict the performance of a modified program without actually having to run it. This paper presents a new machine learning based technique to automatically predict the speedup of a modified program using a performance model based on the code features of the tuned programs. Unlike previous approaches it does not require any prior learning over a benchmark suite. Furthermore, it can be used to predict the performance of any tuning and is not restricted to a prior seen trans-formation space. We show that it can deliver predictions with a high correlation coefficient and can be used to dramatically reduce the cost of search.

ACM Transactions in Embedded Computing Systems | 2003

Array recovery and high-level transformations for DSP applications

Björn Franke; Michael F. P. O'Boyle

Efficient implementation of DSP applications is critical for many embedded systems. Optimizing compilers for application programs, written in C, largely focus on code generation and scheduling, which, with their growing maturity, are providing diminishing returns. As DSP applications typically make extensive use of pointer arithmetic, the alternative use of high-level, source-to-source, transformations has been largely ignored. This article develops an array recovery technique that automatically converts pointers to arrays, enabling the empirical evaluation of high-level transformations. High-level techniques were applied to the DSPstone benchmarks on three platforms: TriMedia TM-1000, Texas Instruments TMS320C6201, and the Analog Devices SHARC ADSP-21160. On average, the best transformation gave a factor of 2.43 improvement across the platforms. In certain cases, a speedup of 5.48 was found for the SHARC, 7.38 for the TM-1, and 2.3 for the C6201. These preliminary results justify pointer to array conversion and further investigation into the use of high-level techniques for embedded compilers.

programming language design and implementation | 2011

Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator

Igor Böhm; Tobias J. K. Edler von Koch; Stephen C. Kyle; Björn Franke; Nigel P. Topham

Dynamic Binary Translation (DBT) is the key technology behind cross-platform virtualization and allows software compiled for one Instruction Set Architecture (ISA) to be executed on a processor supporting a different ISA. Under the hood, DBT is typically implemented using Just-In-Time (JIT) compilation of frequently executed program regions, also called traces. The main challenge is translating frequently executed program regions as fast as possible into highly efficient native code. As time for JIT compilation adds to the overall execution time, the JIT compiler is often decoupled and operates in a separate thread independent from the main simulation loop to reduce the overhead of JIT compilation. In this paper we present two innovative contributions. The first contribution is a generalized trace compilation approach that considers all frequently executed paths in a program for JIT compilation, as opposed to previous approaches where trace compilation is restricted to paths through loops. The second contribution reduces JIT compilation cost by compiling several hot traces in a concurrent task farm. Altogether we combine generalized light-weight tracing, large translation units, parallel JIT compilation and dynamic work scheduling to ensure timely and efficient processing of hot traces. We have evaluated our industry-strength, LLVM-based parallel DBT implementing the ARCompact ISA against three benchmark suites (EEMBC, BioPerf and SPEC CPU2006) and demonstrate speedups of up to 2.08 on a standard quad-core Intel Xeon machine. Across short- and long-running benchmarks our scheme is robust and never results in a slowdown. In fact, using four processors total execution time can be reduced by on average 11.5% over state-of-the-art decoupled, parallel (or asynchronous) JIT compilation.

international conference on parallel architectures and compilation techniques | 2010

Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information

Georgios Tournavitis; Björn Franke

In recent years multi-core computer systems have left the realm of high-performance computing and virtually all of todays desktop computers and embedded computing systems are equipped with several processing cores. Still, no single parallel programming model has found widespread support and parallel programming remains an art for the majority of application programmers. In addition, there exists a plethora of sequential legacy applications for which automatic parallelization is the only hope to benefit from the increased processing power of modern multi-core systems. In the past automatic parallelization largely focused on data parallelism. In this paper we present a novel approach to extracting and exploiting pipeline parallelism from sequential applications. We use profiling to overcome the limitations of static data and control flow analysis enabling more aggressive parallelization. Our approach is orthogonal to existing automatic parallelization approaches and additional data parallelism may be exploited in the individual pipeline stages. The key contribution of this paper is a whole-program representation that supports profiling, parallelism extraction and exploitation. We demonstrate how this enhances conventional pipeline parallelization by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. We have evaluated our methodology on a set of multimedia and stream processing benchmarks and demonstrate speedups of up to 4.7 on a eight-core Intel Xeon machine.

high-performance computer architecture | 2012

Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs

Karthik T. Sundararajan; Vasileios Porpodas; Timothy M. Jones; Nigel P. Topham; Björn Franke

Intelligently partitioning the last-level cache within a chip multiprocessor can bring significant performance improvements. Resources are given to the applications that can benefit most from them, restricting each core to a number of logical cache ways. However, although overall performance is increased, existing schemes fail to consider energy saving when making their partitioning decisions. This paper presents Cooperative Partitioning, a runtime partitioning scheme that reduces both dynamic and static energy while maintaining high performance. It works by enforcing cached data to be way-aligned, so that a way is owned by a single core at any time. Cores cooperate with each other to migrate ways between themselves after partitioning decisions have been made. Upon access to the cache, a core needs only to consult the ways that it owns to find its data, saving dynamic energy. Unused ways can be power-gated for static energy saving. We evaluate our approach on two-core and four-core systems, showing that we obtain average dynamic and static energy savings of 35% and 25% compared to a fixed partitioning scheme. In addition, Cooperative Partitioning maintains high performance while transferring ways five times faster than an existing state-of-the-art technique.

software and compilers for embedded systems | 2008

Fast cycle-approximate instruction set simulation

Björn Franke

Instruction set simulators are indispensable tools in both ASIP design space exploration and the software development and optimisation process for existing platforms. Despite the recent progress in improving the speed of functional instruction set simulators cycle-accurate simulation is still prohibitively slow for all but the most simple programs. This severely limits the applicability of cycle-accurate simulators in the performance evaluation of complex embedded applications. In this paper we present a novel approach, namely the prediction of cycle counts based on information gathered during fast functional simulation and prior training. We have evaluated our approach against a cycle-accurate ARM v5 architecture simulator and a large set of benchmarks. We demonstrate it is capability of providing highly accurate performance predictions with an average error of less than 5.8% at a fraction of the time for cycle-accurate simulation.

compiler construction | 2001

Compiler Transformation of Pointers to Explicit Array Accesses in DSP Applications

Björn Franke; Michael F. P. O'Boyle

Efficient implementation of DSP applications is critical for embedded systems. However, current applications written in C, make extensive use of pointer arithmetic making compiler analysis and optimisation difficult. This paper presents a method for conversion of a restricted class of pointer-based memory accesses typically found in DSP codes into array accesses with explicit index functions. C programs with pointer accesses to array elements, data independent pointer arithmetic and structured loops can be converted into semantically equivalent representations with explicit array accesses. This technique has been applied to several DSPstone benchmarks on three different processors where initial results show that this technique can give on average a 11.95 % reduction in execution time after transforming pointer-based array accesses into explicit array accesses.

Explore More