Matt J Martineau | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matt J Martineau is active.

Explore More

Publication

Featured researches published by Matt J Martineau.

programming models and applications for multicores and manycores | 2016

An Evaluation of Emerging Many-Core Parallel Programming Models

Matt J Martineau; Simon N McIntosh-Smith; Michael Boulton; Wayne Gaudin

In this work we directly evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port TeaLeaf, a miniature proxy application, or mini-app, that solves the heat conduction equation, and belongs to the Mantevo suite of applications. We find that the best performance is achieved with device-tuned implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5-20% performance penalty. The models expose varying levels of complexity to the developer, and they all present reasonable performance. We believe that complexity will become the major influencer in the long-term adoption of such models.

international parallel and distributed processing symposium | 2016

Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model

Matt J Martineau; Simon N McIntosh-Smith; Wayne Gaudin

Although the OpenMP 4.0 standard has been available since 2013, support for GPUs has been absent up until very recently, with only a handful of experimental compilers available. In this work we evaluate the performance of Crays new NVIDIA GPU targeting implementation of OpenMP 4.0, with the mini-apps TeaLeaf, CloverLeaf and BUDE. We successfully port each of the applications, using a simple and consistent design throughout, and achieve performance on an NVIDIA K20X that is comparable to Crays OpenACC in all cases. BUDE, a compute bound code, required 2.2x the runtime of an equivalently optimised CUDA code, which we believe is caused by an inflated frequency of control flow operations and less efficient arithmetic optimisation. Impressively, both TeaLeaf and CloverLeaf, memory bandwidth bound codes, only required 1.3x the runtime of hand-optimised CUDA implementations. Overall, we find that OpenMP 4.0 is a highly usable open standard capable of performant heterogeneous execution, making it a promising option for scientific application developers.

ieee international conference on high performance computing, data, and analytics | 2016

GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models

Tom J Deakin; James Price; Matt J Martineau; Simon N McIntosh-Smith

Many scientific codes consist of memory bandwidth bound kernels — the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before results are written back to memory. One major advantage of many-core devices such as General Purpose Graphics Processing Units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. However, as with CPUs, this peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance.

international workshop on openmp | 2016

Pragmatic Performance Portability with OpenMP 4.x

Matt J Martineau; James Price; Simon N McIntosh-Smith; Wayne Gaudin

In this paper we investigate the current compiler technologies supporting OpenMP 4.x features targeting a range of devices, in particular, the Cray compiler 8.5.0 targeting an Intel Xeon Broadwell and NVIDIA K20x, IBM’s OpenMP 4.5 Clang branch (clang-ykt) targeting an NVIDIA K20x, the Intel compiler 16 targeting an Intel Xeon Phi Knights Landing, and GCC 6.1 targeting an AMD APU. We outline the mechanisms that they use to map the OpenMP model onto their target architectures, and conduct performance testing with a number of representative data parallel kernels. Following this we present a discussion about the current state of play in terms of performance portability and propose some straightforward guidelines for writing performance portable code, derived from our observations. At the time of writing, developers will likely have to rely on the pre-processor for certain kernels to achieve functional portability, but we expect that future homogenisation of required directives between compilers and architectures is feasible.

International Journal of High Performance Computing Applications | 2018

An Improved Parallelism Scheme for Deterministic Discrete Ordinates Transport

Tom J Deakin; Simon N McIntosh-Smith; Matt J Martineau; Wayne Gaudin

In this paper we demonstrate techniques for increasing the node-level parallelism of a deterministic discrete ordinates neutral particle transport algorithm on a structured mesh to exploit many-core technologies. Transport calculations form a large part of the computational workload of physical simulations and so good performance is vital for the simulations to complete in reasonable time. We will demonstrate our approach utilizing the SNAP mini-app, which gives a simplified implementation of the full transport algorithm but remains similar enough to the real algorithm to act as a useful proxy for research purposes. We present an OpenCL implementation of our improved algorithm which achieves a speedup of up to 2.5 × on a many-core GPGPU device compared to a state-of-the-art multi-core node for the transport sweep, and up to 4 × compared to the multi-core CPUs in the largest GPU enabled supercomputer; the first time this scale of speedup has been achieved for algorithms of this class. We then discuss ways to express our scheme in OpenMP 4.0 and demonstrate the performance on an Intel Knights Corner Xeon Phi compared to the original scheme.

international workshop on openmp | 2017

The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs

Matt J Martineau; Simon N McIntosh-Smith

This research considers the productivity, portability, and performance offered by the OpenMP parallel programming model, from the perspective of scientific applications. We discuss important considerations for scientific application developers tackling large software projects with OpenMP, including straightforward code mechanisms to improve productivity and portability. Performance results are presented across multiple modern HPC devices, including Intel Xeon, and Xeon Phi CPUs, POWER8 CPUs, and NVIDIA GPUs. The results are collected for three exemplar applications: hydrodynamics, heat conduction and neutral particle transport, using modern compilers with OpenMP support. The results show that while current OpenMP implementations are able to achieve good performance on the breadth of modern hardware for memory bandwidth bound applications, our memory latency bound application performs less consistently.

international conference on cluster computing | 2017

The Arch Project: Physics Mini-Apps for Algorithmic Exploration and Evaluating Programming Environments on HPC Architectures

Matt J Martineau; Simon N McIntosh-Smith

The arch project is a suite of mini-apps that have been developed with consistent coding practices, under a common infrastructural layer. Great emphasis has been placed on making the applications concise and easy to manipulate, while capturing the key performance characteristics of their proxied algorithmic classes. The suite is intended for traditional exploration of performance, portability and productivity on modern HPC architectures, but also introduces the potential for focussing on those characteristics of production application stacks that are not generally exposed with isolated mini-app developments. In this paper we discuss the implementation of each of the mini-apps, and present key findings from the development and optimisation process, alongside details of important future research directions.

ieee international conference on high performance computing data and analytics | 2016

Performance analysis and optimization of Clang's OpenMP 4.5 GPU support

Matt J Martineau; Simon N McIntosh-Smith; Carlo Bertolli; Arpith C. Jacob; Samuel F. Antao; Alexandre E. Eichenberger; Gheorghe-Teodor Bercea; Tong Chen; Tian Jin; Kevin O'Brien; Georgios Rokos; Hyojin Sung; Zehra Sura

The Clang implementation of OpenMP® 4.5 now provides full support for the specification, offering the only open source option for targeting NVIDIA® GPUs. While using OpenMP allows portability across different architectures, matching native CUDA® performance without major code restructuring is an open research issue.In order to analyze the current performance, we port a suite of representative benchmarks, and the mature mini-apps TeaLeaf, CloverLeaf, and SNAP to the Clang OpenMP 4.5 compiler. We then collect performance results for those ports, and their equivalent CUDA ports, on an NVIDIA Kepler GPU. Through manual analysis of the generated code, we are able to discover the root cause of the performance differences between OpenMP and CUDA.A number of improvements can be made to the existing compiler implementation to enable performance that approaches that of hand-optimized CUDA. Our first observation was that the generated code did not use fused-multiply-add instructions, which was resolved using an existing flag. Next we saw that the compiler was not passing any loads through non-coherent cache, and added a new flag to the compiler to assist with this problem.We then observed that the compiler partitioning of threads and teams could be improved upon for the majority of kernels, which guided work to ensure that the compiler can pick more optimal defaults. We uncovered a register allocation issue with the existing implementation that, when fixed alongside the other issues, enables performance that is close to CUDA.Finally, we use some different kernels to emphasize that support for managing memory hierarchies needs to be introduced into the specification, and propose a simple option for programming shared caches.

international conference on cluster computing | 2017

Exploring On-Node Parallelism with Neutral, a Monte Carlo Neutral Particle Transport Mini-App

Matt J Martineau; Simon N McIntosh-Smith

In this research we describe the development and optimisation of a new Monte Carlo neutral particle transport mini-app, neutral. In spite of the success of previous research efforts to load balance the algorithm at scale, it is not clear how to take advantage of the diverse architectures being installed in the newest supercomputers. We explore different algorithmic approaches, and perform extensive investigations into the performance of the application on modern hardware including Intel Xeon and Xeon Phi CPUs, POWER8 CPUs, and NVIDIA GPUs.When applied to particle transport the Monte Carlo method is not embarrassingly parallel, as might be expected, due to dependencies on the computational mesh that expose random memory access patterns. The algorithm requires the use of atomic operations, and exhibits load imbalance at the node-level due to the random branching of particle histories. The algorithmic characteristics make it challenging to exploit the high memory bandwidth and FLOPS of modern HPC architectures.Both of the parallelisation schemes discussed in this paper are dominated latency issues caused by poor data locality, and are restricted by the use of atomic operations for tallying calculations. We saw a significant improvement in performance through the use of hyperthreading on all CPUs and best performance on the NVIDIA P100 GPU. A key observation is that architectures that are tolerant to latencies may be able to hide the negative characteristics of the algorithms.

Concurrency and Computation: Practice and Experience | 2017

Assessing the performance portability of modern parallel programming models using TeaLeaf: Assessing the performance portability of modern parallel programming models using Tealeaf

Matt J Martineau; Simon N McIntosh-Smith; Wayne Gaudin

In this work, we evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port Tealeaf, a miniature proxy application, or mini app, that solves the heat conduction equation and belongs to the Mantevo Project. We find that the best performance is achieved with architecture‐specific implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5% to 30% performance penalty. While the models expose varying levels of complexity to the developer, they all achieve reasonable performance with this application. As such, if this small performance penalty is permissible for a problem domain, we believe that productivity and development complexity can be considered the major differentiators when choosing a modern parallel programming model to develop applications like Tealeaf.

Explore More