Wayne Gaudin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wayne Gaudin is active.

Explore More

Publication

Featured researches published by Wayne Gaudin.

ieee international conference on high performance computing data and analytics | 2012

Accelerating Hydrocodes with OpenACC, OpeCL and CUDA

J. A. Herdman; Wayne Gaudin; Simon N McIntosh-Smith; Michael Boulton; David A. Beckingsale; Andrew C. Mallinson; Stephen A. Jarvis

Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies-OpenACC, OpenCL and CUDA-in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.

programming models and applications for multicores and manycores | 2016

An Evaluation of Emerging Many-Core Parallel Programming Models

Matt J Martineau; Simon N McIntosh-Smith; Michael Boulton; Wayne Gaudin

In this work we directly evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port TeaLeaf, a miniature proxy application, or mini-app, that solves the heat conduction equation, and belongs to the Mantevo suite of applications. We find that the best performance is achieved with device-tuned implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5-20% performance penalty. The models expose varying levels of complexity to the developer, and they all present reasonable performance. We believe that complexity will become the major influencer in the long-term adoption of such models.

international parallel and distributed processing symposium | 2016

Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model

Matt J Martineau; Simon N McIntosh-Smith; Wayne Gaudin

Although the OpenMP 4.0 standard has been available since 2013, support for GPUs has been absent up until very recently, with only a handful of experimental compilers available. In this work we evaluate the performance of Crays new NVIDIA GPU targeting implementation of OpenMP 4.0, with the mini-apps TeaLeaf, CloverLeaf and BUDE. We successfully port each of the applications, using a simple and consistent design throughout, and achieve performance on an NVIDIA K20X that is comparable to Crays OpenACC in all cases. BUDE, a compute bound code, required 2.2x the runtime of an equivalently optimised CUDA code, which we believe is caused by an inflated frequency of control flow operations and less efficient arithmetic optimisation. Impressively, both TeaLeaf and CloverLeaf, memory bandwidth bound codes, only required 1.3x the runtime of hand-optimised CUDA implementations. Overall, we find that OpenMP 4.0 is a highly usable open standard capable of performant heterogeneous execution, making it a promising option for scientific application developers.

international conference on parallel processing | 2015

Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units

David A. Beckingsale; Wayne Gaudin; Andrew Herdman; Stephen A. Jarvis

Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratorys Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models | 2014

Experiences at scale with PGAS versions of a Hydrodynamics application

Andrew C. Mallinson; Stephen A. Jarvis; Wayne Gaudin; J. A. Herdman

In this work we directly evaluate two PGAS programming models, CAF and OpenSHMEM, as candidate technologies for improving the performance and scalability of scientific applications on future exascale HPC platforms. PGAS approaches are considered by many to represent a promising research direction with the potential to solve some of the existing problems preventing codebases from scaling to exascale levels of performance. The aim of this work is to better inform the exacsale planning at large HPC centres such as AWE. Such organisations invest significant resources maintaining and updating existing scientific codebases, many of which were not designed to run at the scales required to reach exascale levels of computational performance on future system architectures. We document our approach for implementing a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application in each of these PGAS languages. Furthermore, we also present our results and experiences from scaling these different approaches to high node counts on two state-of-the-art, large scale system architectures from Cray (XC30) and SGI (ICE-X), and compare their utility against an equivalent existing MPI implementation.

international workshop on openmp | 2016

Pragmatic Performance Portability with OpenMP 4.x

Matt J Martineau; James Price; Simon N McIntosh-Smith; Wayne Gaudin

In this paper we investigate the current compiler technologies supporting OpenMP 4.x features targeting a range of devices, in particular, the Cray compiler 8.5.0 targeting an Intel Xeon Broadwell and NVIDIA K20x, IBM’s OpenMP 4.5 Clang branch (clang-ykt) targeting an NVIDIA K20x, the Intel compiler 16 targeting an Intel Xeon Phi Knights Landing, and GCC 6.1 targeting an AMD APU. We outline the mechanisms that they use to map the OpenMP model onto their target architectures, and conduct performance testing with a number of representative data parallel kernels. Following this we present a discussion about the current state of play in terms of performance portability and propose some straightforward guidelines for writing performance portable code, derived from our observations. At the time of writing, developers will likely have to rely on the pre-processor for certain kernels to achieve functional portability, but we expect that future homogenisation of required directives between compilers and architectures is feasible.

International Journal of High Performance Computing Applications | 2018

An Improved Parallelism Scheme for Deterministic Discrete Ordinates Transport

Tom J Deakin; Simon N McIntosh-Smith; Matt J Martineau; Wayne Gaudin

In this paper we demonstrate techniques for increasing the node-level parallelism of a deterministic discrete ordinates neutral particle transport algorithm on a structured mesh to exploit many-core technologies. Transport calculations form a large part of the computational workload of physical simulations and so good performance is vital for the simulations to complete in reasonable time. We will demonstrate our approach utilizing the SNAP mini-app, which gives a simplified implementation of the full transport algorithm but remains similar enough to the real algorithm to act as a useful proxy for research purposes. We present an OpenCL implementation of our improved algorithm which achieves a speedup of up to 2.5 × on a many-core GPGPU device compared to a state-of-the-art multi-core node for the transport sweep, and up to 4 × compared to the multi-core CPUs in the largest GPU enabled supercomputer; the first time this scale of speedup has been achieved for algorithms of this class. We then discuss ways to express our scheme in OpenMP 4.0 and demonstrate the performance on an Intel Knights Corner Xeon Phi compared to the original scheme.

ieee international conference on high performance computing, data, and analytics | 2016

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

Tom J Deakin; Simon N McIntosh-Smith; Wayne Gaudin

Time-dependent deterministic discrete ordinates transport codes are an important class of application which provide significant challenges for large, many-core systems. One such challenge is the large memory capacity needed by the solve step, which requires us to have a scalable solution in order to have enough node-level memory to store all the data. In our previous work, we demonstrated the first implementation which showed a significant performance benefit for single node solves using GPUs. In this paper we extend our work to large problems and demonstrate the scalability of our solution on two Petascale GPU-based supercomputers: Titan at Oak Ridge and Piz Daint at CSCS. Our results show that our improved node-level parallelism scheme scales just as well across large systems as previous approaches when using the tried and tested KBA domain decomposition technique. We validate our results against an improved performance model which predicts the runtime of the main ‘sweep’ routine when running on different hardware, including CPUs or GPUs.

measurement and modeling of computer systems | 2011

Benchmarking and modelling of POWER7, Westmere, BG/P, and GPUs: an industry case study

J. A. Herdman; Wayne Gaudin; David Turland; Simon D. Hammond

This paper introduces an industry strength, multi-purpose, benchmark: Shamrock. Developed at the Atomic Weapons Establishment (AWE), Shamrock is a two dimensional (2D) structured hydrocode; one of its aims is to assess the impacts of a change in hardware, and (in conjunction with a larger HPC Benchmark Suite) to provide guidance in procurement of future systems. A suitable test problem is described and executed on a local, high-end, workstation for a range of compilers and MPI implementations. Based on these observations, specific configurations are subsequently built and executed on a selection of HPC architectures, including Intels Nehalem and Westmere micro architectures, IBMs POWER-5, POWER-6, POWER-7, BlueGene/L, BlueGene/P, and AMDs Opteron chip set. Comparisons are made between these architectures, for the Shamrock benchmark, and relative compute resources are specified that deliver similar time to solution, along with their associated power budgets. Additionally, performance comparisons are made for a port of the benchmark to a Nehalem based cluster, accelerated with Tesla C1060 GPUs, with details of the port, and extrapolations to possible performance of the GPU.

international conference on cluster computing | 2015

Expressing Parallelism on Many-Core for Deterministic Discrete Ordinates Transport

Tom J Deakin; Simon N McIntosh-Smith; Wayne Gaudin

In this paper we demonstrate techniques for increasing the node-level parallelism of a deterministic discrete ordinates neutral particle transport algorithm on a structured mesh to exploit many-core technologies. Transport calculations form a large part of the computational workload of physical simulations and so good performance is vital for the simulations to complete in reasonable time. We will demonstrate our approach utilizing the SNAP mini-app, which gives a simplified implementation of the full transport algorithm but remains similar enough to the real algorithm to act as a useful proxy for research purposes. We present an OpenCL implementation of our improved algorithm which demonstrates a speedup of up to 2.5x the transport sweep performance on a many-core GPGPU device compared to a state-of-the-art multi-core node, the first time this scale of speedup has been achieved for algorithms of this class.

Explore More