Michael F. P. O'Boyle

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael F. P. O'Boyle is active.

Explore More

Publication

Featured researches published by Michael F. P. O'Boyle.

languages and compilers for parallel computing | 2009

Reducing training time in a one-shot machine learning-based compiler

John Thomson; Michael F. P. O'Boyle; Grigori Fursin; Björn Franke

Iterative compilation of applications has proved a popular and successful approach to achieving high performance. This however, is at the cost of many runs of the application. Machine learning based approaches overcome this at the expense of a large off-line training cost. This paper presents a new approach to dramatically reduce the training time of a machine learning based compiler. This is achieved by focusing on the programs which best characterize the optimization space. By using unsupervised clustering in the program feature space we are able to dramatically reduce the amount of time required to train a compiler. Furthermore, we are able to learn a model which dispenses with iterative search completely allowing integration within the normal program development cycle. We evaluated our clustering approach on the EEMBCv2 benchmark suite and show that we can reduce the number of training runs by more than a factor of 7. This translates into an average 1.14 speedup across the benchmark suite compared to the default highest optimization level.

symposium on code generation and optimization | 2006

Using Machine Learning to Focus Iterative Optimization

Felix Agakov; Edwin V. Bonilla; John Cavazos; Björn Franke; Grigori Fursin; Michael F. P. O'Boyle; John Thomson; Marc Toussaint; Christopher K. I. Williams

Iterative compiler optimization has been shown to outperform static approaches. This, however, is at the cost of large numbers of evaluations of the program. This paper develops a new methodology to reduce this number and hence speed up iterative optimization. It uses predictive modelling from the domain of machine learning to automatically focus search on those areas likely to give greatest performance. This approach is independent of search algorithm, search space or compiler infrastructure and scales gracefully with the compiler optimization space size. Off-line, a training set of programs is iteratively evaluated and the shape of the spaces and program features are modelled. These models are learnt and used to focus the iterative optimization of a new program. We evaluate two learnt models, an independent and Markov model, and evaluate their worth on two embedded platforms, the Texas Instrument C67I3 and the AMD Au1500. We show that such learnt models can speed up iterative search on large spaces by an order of magnitude. This translates into an average speedup of 1.22 on the TI C6713 and 1.27 on the AMD Au1500 in just 2 evaluations.

programming language design and implementation | 2009

Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping

Georgios Tournavitis; Zheng Wang; Björn Franke; Michael F. P. O'Boyle

Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach, resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to identify more application parallelism and only rely on the user for final approval. In addition, we replace the traditional target-specific and inflexible mapping heuristics with a machine-learning based prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architectures. We have evaluated our parallelization strategy against the NAS and SPEC OMP benchmarks and two different multi-core platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields significant improvements when compared with state-of-the-art parallelizing compilers, but comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell platform, demonstrating the potential of profile-guided and machine-learning based parallelization for complex multi-core platforms.

international conference on parallel architectures and compilation techniques | 2000

Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation

Toru Kisuki; Peter M. W. Knijnenburg; Michael F. P. O'Boyle

Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. In this paper, we address the problem of how to select tile sizes and unroll factors simultaneously. We approach this problem in an architecturally adaptive manner by means of iterative compilation, where we generate many versions of a program and decide upon the best by actually executing them and measuring their execution time. We evaluate several iterative strategies. We compare the levels of optimization obtained by iterative compilation to several well-known static techniques and show that we outperform each of them on a range of benchmarks across a variety of architectures. Finally, we show how to quantitatively trade-off the number of profiles needed and the level of optimization that can be reached.

symposium on code generation and optimization | 2007

Rapidly Selecting Good Compiler Optimizations using Performance Counters

John Cavazos; Grigori Fursin; Felix Agakov; Edwin V. Bonilla; Michael F. P. O'Boyle; Olivier Temam

Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is non-trivial. There have been several proposed techniques that search the space of compiler options to find good solutions; however such approaches can be expensive. This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings. This is achieved by learning a model off-line which can then be used to determine good settings for any new program. We show that such an approach outperforms the state-of-the-art and is two orders of magnitude faster on average. Furthermore, we show that our performance counter-based approach outperforms techniques based on static code features. Using our technique we achieve a 17% improvement over the highest optimization setting of the commercial PathScale EKOPath 2.3.1 optimizing compiler on the SPEC benchmark suite on a AMD Athlon 64 3700+ platform

high performance embedded architectures and compilers | 2007

MiDataSets: creating the conditions for a more realistic evaluation of Iterative optimization

Grigori Fursin; John Cavazos; Michael F. P. O'Boyle; Olivier Temam

Iterative optimization has become a popular technique to obtain improvements over the default settings in a compiler for performance-critical applications, such as embedded applications. An implicit assumption, however, is that the best configuration found for any arbitrary data set will work well with other data sets that a program uses. In this article, we evaluate that assumption based on 20 data sets per benchmark of the MiBench suite. We find that, though a majority of programs exhibit stable performance across data sets, the variability can significantly increase with many optimizations. However, for the best optimization configurations, we find that this variability is in fact small. Furthermore, we show that it is possible to find a compromise optimization configuration across data sets which is often within 5% of the best possible configuration for most data sets, and that the iterative process can converge in less than 20 iterations (for a population of 200 optimization configurations). All these conclusions have significant and positive implications for the practical utilization of iterative optimization.

acm sigplan symposium on principles and practice of parallel programming | 2009

Mapping parallelism to multi-cores: a machine learning based approach

Zheng Wang; Michael F. P. O'Boyle

The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It develops two predictors: a data sensitive and a data insensitive predictor to select the best mapping for parallel programs. They predict the number of threads and the scheduling policy for any given program using a model learnt off-line. By using low-cost profiling runs, they predict the mapping for a new unseen program across multiple input data sets. We evaluate our approach by selecting parallelism mapping configurations for OpenMP programs on two representative but different multi-core platforms (the Intel Xeon and the Cell processors). Performance of our technique is stable across programs and architectures. On average, it delivers above 96% performance of the maximum available on both platforms. It achieve, on average, a 37% (up to 17.5 times) performance improvement over the OpenMP runtime default scheme on the Cell platform. Compared to two recent prediction models, our predictors achieve better performance with a significant lower profiling cost.

compiler construction | 2011

A static task partitioning approach for heterogeneous systems using OpenCL

Dominik Grewe; Michael F. P. O'Boyle

Heterogeneous multi-core platforms are increasingly prevalent due to their perceived superior performance over homogeneous systems. The best performance, however, can only be achieved if tasks are accurately mapped to the right processors. OpenCL programs can be partitioned to take advantage of all the available processors in a system. However, finding the best partitioning for any heterogeneous system is difficult and depends on the hardware and software implementation. We propose a portable partitioning scheme for OpenCL programs on heterogeneous CPU-GPU systems. We develop a purely static approach based on predictive modelling and program features. When evaluated over a suite of 47 benchmarks, our model achieves a speedup of 1.57 over a state-of-the-art dynamic run-time approach, a speedup of 3.02 over a purely multi-core approach and 1.55 over the performance achieved by using just the GPU.

european conference on parallel processing | 2004

Cross Component Optimisation in a High Level Category-Based Language

Thomas J. Ashby; Anthony D. Kennedy; Michael F. P. O'Boyle

High level programming languages offer many benefits in terms of ease of use, encapsulation etc. However, they historically suffer from poor performance. In this paper we investigate improving the performance of a numerical code written in a high–level language by using cross–component optimisation. We compare the results with traditional approaches such as the use of high performance libraries or Fortran. We demonstrate that our cross–component optimisation is highly effective, with a speed–up of up to 1.43 over a program augmented with calls to the ATLAS BLAS library, and 1.5 over a pure Fortran equivalent.

international symposium on microarchitecture | 2009

Portable compiler optimisation across embedded programs and microarchitectures using machine learning

Christophe Dubach; Timothy M. Jones; Edwin V. Bonilla; Grigori Fursin; Michael F. P. O'Boyle

Building an optimising compiler is a difficult and time consuming task which must be repeated for each generation of a microprocessor. As the underlying microarchitecture changes from one generation to the next, the compiler must be retuned to optimise specifically for that new system. It may take several releases of the compiler to effectively exploit a processors performance potential, by which time a new generation has appeared and the process starts again. We address this challenge by developing a portable optimising compiler. Our approach employs machine learning to automatically learn the best optimisations to apply for any new program on a new microarchitectural configuration. It achieves this by learning a model off-line which maps a microarchitecture description plus the hardware counters from a single run of the program to the best compiler optimisation passes. Our compiler gains 67% of the maximum speedup obtainable by an iterative compiler search using 1000 evaluations. We obtain, on average, a 1.16x speedup over the highest default optimisation level across an entire microarchitecture configuration space, achieving a 4.3x speedup in the best case. We demonstrate the robustness of this technique by applying it to an extended microarchitectural space where we achieve comparable performance.

Explore More