David A. Beckingsale | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David A. Beckingsale is active.

Explore More

Publication

Featured researches published by David A. Beckingsale.

ieee international conference on high performance computing data and analytics | 2012

Accelerating Hydrocodes with OpenACC, OpeCL and CUDA

J. A. Herdman; Wayne Gaudin; Simon N McIntosh-Smith; Michael Boulton; David A. Beckingsale; Andrew C. Mallinson; Stephen A. Jarvis

Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies-OpenACC, OpenCL and CUDA-in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.

international conference on parallel processing | 2015

Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units

David A. Beckingsale; Wayne Gaudin; Andrew Herdman; Stephen A. Jarvis

Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratorys Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.

EPEW'12 Proceedings of the 9th European conference on Computer Performance Engineering | 2012

Optimisation of patch distribution strategies for AMR applications

David A. Beckingsale; O. F. J. Perks; W. P. Gaudin; J. A. Herdman; Stephen A. Jarvis

As core counts increase in the worlds most powerful supercomputers, applications are becoming limited not only by computational power, but also by data availability. In the race to exascale, efficient and effective communication policies are key to achieving optimal application performance. Applications using adaptive mesh refinement (AMR) trade off communication for computational load balancing, to enable the focused computation of specific areas of interest. This class of application is particularly susceptible to the communication performance of the underlying architectures, and are inherently difficult to scale efficiently. In this paper we present a study of the effect of patch distribution strategies on the scalability of an AMR code. We demonstrate the significance of patch placement on communication overheads, and by balancing the computation and communication costs of patches, we develop a scheme to optimise performance of a specific, industry-strength, benchmark application.

The Computer Journal | 2013

Towards Automated Memory Model Generation Via Event Tracing

O. F. J. Perks; David A. Beckingsale; Simon D. Hammond; I. Miller; J. A. Herdman; A. Vadgama; Abhir Bhalerao; Ligang He; Stephen A. Jarvis

The importance of memory performance and capacity is a growing concern for high performance computing laboratories around the world. It has long been recognized that improvements in processor speed exceed the rate of improvement in dynamic random access memory speed and, as a result, memory access times can be the limiting factor in high performance scientific codes. The use of multi-core processors exacerbates this problem with the rapid growth in the number of cores not being matched by similar improvements in memory capacity, increasing the likelihood of memory contention. In this paper, we present WMTools, a lightweight memory tracing tool and analysis framework for parallel codes, which is able to identify peak memory usage and also analyse per-function memory use over time. An evaluation of WMTools, in terms of its effectiveness and also its overheads, is performed using nine established scientific applications/benchmark codes representing a variety of programming languages and scientific domains. We also show how WMTools can be used to automatically generate a parameterized memory model for one of these applications, a two-dimensional non-linear magnetohydrodynamics application, Lare2D. Through the memory model we are able to identify an unexpected growth term which becomes dominant at scale. With a refined model we are able to predict memory consumption with under 7% error.

EPEW'12 Proceedings of the 9th European conference on Computer Performance Engineering | 2012

Performance modelling of magnetohydrodynamics codes

Robert F. Bird; Steven A. Wright; David A. Beckingsale; Stephen A. Jarvis

Performance modelling is an important tool utilised by the High Performance Computing industry to accurately predict the run-time of science applications on a variety of different architectures. Performance models aid in procurement decisions and help to highlight areas for possible code optimisations. This paper presents a performance model for a magnetohydrodynamics physics application, Lare. We demonstrate that this model is capable of accurately predicting the run-time of Lare across multiple platforms with an accuracy of 90% (for both strong and weak scaled problems). We then utilise this model to evaluate the performance of future optimisations. The model is generated using SST/macro, the machine level component of the Structural Simulation Toolkit (SST) from Sandia National Laboratories, and is validated on both a commodity cluster located at the University of Warwick and a large scale capability resource located at Lawrence Livermore National Laboratory.

Archive | 2014

Parallel Block Structured Adaptive Mesh Refinement on Graphics Processing Units

David A. Beckingsale; Wayne Gaudin; Richard D. Hornung; Brian T. N. Gunney; Todd Gamblin; J. A. Herdman; Stephen A. Jarvis

Block-structured adaptive mesh refinement is a technique that can be used when solving partial differential equations to reduce the number of zones necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a native GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an eight-node cluster, and over four thousand nodes of Oak Ridge National Laboratory’s Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and has been scaled to over four thousand GPUs using a combination of MPI and CUDA.

international conference on high performance computing and simulation | 2013

Analysing the influence of InfiniBand choice on OpenMPI memory consumption

O. F. J. Perks; David A. Beckingsale; A. S. Dawes; J. A. Herdman; Cyril Mazauric; Stephen A. Jarvis

The ever increasing scale of modern high performance computing platforms poses challenges for system architects and code developers alike. The increase in core count densities and associated cost of components is having a dramatic effect on the viability of high memory-per-core ratios. Whilst the available memory per core is decreasing, the increased scale of parallel jobs is testing the efficiency of MPI implementations with respect to memory overhead. Scalability issues have always plagued both hardware manufacturers and software developers, and the combined effects can be disabling. In this paper we address the issue of MPI memory consumption with regard to InfiniBand network communications. We reaffirm some widely held beliefs regarding the existence of scalability problems under certain conditions. Additionally, we present results testing memory-optimised runtime configurations and vendor provided optimisation libraries. Using Orthrus, a linear solver benchmark developed by AWE, we demonstrate these memory-centric optimisations and their performance implications. We show the growth of OpenMPI memory consumption (demonstrating poor scalability) on both Mellanox and QLogic InfiniBand platforms. We demonstrate a 616× increase in MPI memory consumption for a 64× increase in core count, with a default OpenMPI configuration on Mellanox. Through the use of the Mellanox MXM and QLogic PSM optimisation libraries we are able to observe a 117× and 115× reduction in MPI memory at application memory high water mark. This significantly improves the potential scalability of the code.

Archive | 2013