J. A. Herdman
Atomic Weapons Establishment
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by J. A. Herdman.
ieee international conference on high performance computing data and analytics | 2012
J. A. Herdman; Wayne Gaudin; Simon N McIntosh-Smith; Michael Boulton; David A. Beckingsale; Andrew C. Mallinson; Stephen A. Jarvis
Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies-OpenACC, OpenCL and CUDA-in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.
simulation tools and techniques for communications, networks and system | 2009
Simon D. Hammond; Gihan R. Mudalige; J. A. Smith; Stephen A. Jarvis; J. A. Herdman; A. Vadgama
There are a number of challenges facing the High Performance Computing (HPC) community, including increasing levels of concurrency (threads, cores, nodes), deeper and more complex memory hierarchies (register, cache, disk, network), mixed hardware sets (CPUs and GPUs) and increasing scale (tens or hundreds of thousands of processing elements). Assessing the performance of complex scientific applications on specialised high-performance computing architectures is difficult. In many cases, traditional computer benchmarking is insufficient as it typically requires access to physical machines of equivalent (or similar) specification and rarely relates to the potential capability of an application. A technique known as application performance modelling addresses many of these additional requirements. Modelling allows future architectures and/or applications to be explored in a mathematical or simulated setting, thus enabling hypothetical questions relating to the configuration of a potential future architecture to be assessed in terms of its impact on key scientific codes. This paper describes the Warwick Performance Prediction (WARPP) simulator, which is used to construct application performance models for complex industry-strength parallel scientific codes executing on thousands of processing cores. The capability and accuracy of the simulator is demonstrated through its application to a scientific benchmark developed by the United Kingdom Atomic Weapons Establishment (AWE). The results of the simulations are validated for two different HPC architectures, each case demonstrating a greater than 90% accuracy for run-time prediction. Simulation results, collected from runs on a standard PC, are provided for up to 65,000 processor cores. It is also shown how the addition of operating system jitter to the simulator can improve the quality of the application performance model results.
Journal of Parallel and Distributed Computing | 2013
Simon J. Pennycook; Simon D. Hammond; Steven A. Wright; J. A. Herdman; I. Miller; Stephen A. Jarvis
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3-1.5x slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3-3.1x slower on multiple nodes. We also explore the potential performance gains of OpenCLs device fissioning capability, demonstrating up to a 3x speed-up over our original OpenCL implementation.
The Computer Journal | 2013
Steven A. Wright; Simon D. Hammond; Simon J. Pennycook; Robert F. Bird; J. A. Herdman; I. Miller; A. Vadgama; Abhir Bhalerao; Stephen A. Jarvis
Input/Output (I/O) operations can represent a significant proportion of the run-time of parallel scientific computing applications. Although there have been several advances in file format libraries, file system design and I/O hardware, a growing divergence exists between the performance of parallel file systems and the compute clusters that they support. In this paper, we document the design and application of the RIOT I/O toolkit (RIOT) being developed at the University of Warwick with our industrial partners at the Atomic Weapons Establishment and Sandia National Laboratories. We use the toolkit to assess the performance of three industry-standard I/O benchmarks on three contrasting supercomputers, ranging from a mid-sized commodity cluster to a large-scale proprietary IBM BlueGene/P system. RIOT provides a powerful framework in which to analyse I/O and parallel file system behaviour—we demonstrate, for example, the large file locking overhead of IBMs General Parallel File System, which can consume nearly 30% of the total write time in the FLASH-IO benchmark. Through I/O trace analysis, we also assess the performance of HDF-5 in its default configuration, identifying a bottleneck created by the use of suboptimal Message Passing Interface hints. Furthermore, we investigate the performance gains attributed to the Parallel Log-structured File System (PLFS) being developed by EMC Corporation and the Los Alamos National Laboratory. Our evaluation of PLFS involves two high-performance computing systems with contrasting I/O backplanes and illustrates the varied improvements to I/O that result from the deployment of PLFS (ranging from up to 25× speed-up in I/O performance on a large I/O installation to 2× speed-up on the much smaller installation at the University of Warwick).
IET Software | 2009
Simon D. Hammond; Gihan R. Mudalige; J. A. Smith; Jim Davis; A. B. Mills; Stephen A. Jarvis; J. Holt; I. Miller; J. A. Herdman; A. Vadgama
The cost of state-of-the-art supercomputing resources makes each individual purchase a length and expensive process. Often each candidate architecture will need to be benchmarked using a variety of tools to assess likely performance. However, benchmarking alone only provides a limited insight into the suitability of each architecture for key codes and will give potentially misleading results when assessing their scalability. In this study the authors present a case study of the application of recently developed performance models of the Chimaera benchmarking code written by the United Kingdom Atomic Weapons Establishment (AWE), with a view to analysing how the code will perform and scale on a medium sized, commodity-based InfiniBand cluster. The models are validated and demonstrate a greater than 90% accuracy for an existing InfiniBand machine; the models are then used as the basis for predicting code performance on a variety of alternative hardware configurations which include changes in the underlying network, the use of faster processors and the use of a higher core density per processor. The results demonstrate the compute-bound nature of Chimaera and its sensitivity to network latency at increased processor counts. By using these insights the authors are able to discuss potential strategies which may be employed during the procurement of future mid-range clusters for wavefront-rich workloads.
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models | 2014
Andrew C. Mallinson; Stephen A. Jarvis; Wayne Gaudin; J. A. Herdman
In this work we directly evaluate two PGAS programming models, CAF and OpenSHMEM, as candidate technologies for improving the performance and scalability of scientific applications on future exascale HPC platforms. PGAS approaches are considered by many to represent a promising research direction with the potential to solve some of the existing problems preventing codebases from scaling to exascale levels of performance. The aim of this work is to better inform the exacsale planning at large HPC centres such as AWE. Such organisations invest significant resources maintaining and updating existing scientific codebases, many of which were not designed to run at the scales required to reach exascale levels of computational performance on future system architectures. We document our approach for implementing a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application in each of these PGAS languages. Furthermore, we also present our results and experiences from scaling these different approaches to high node counts on two state-of-the-art, large scale system architectures from Cray (XC30) and SGI (ICE-X), and compare their utility against an equivalent existing MPI implementation.
International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2014
Gihan R. Mudalige; I. Z. Reguly; Michael B. Giles; Ac Mallinson; Wp Gaudin; J. A. Herdman
In this paper we present research on applying a domain specific high-level abstractions (HLA) development strategy with the aim to “future-proof” a key class of high performance computing (HPC) applications that simulate hydrodynamics computations at AWE plc. We build on an existing high-level abstraction framework, OPS, that is being developed for the solution of multi-block structured mesh-based applications at the University of Oxford. OPS uses an “active library” approach where a single application code written using the OPS API can be transformed into different highly optimized parallel implementations which can then be linked against the appropriate parallel library enabling execution on different back-end hardware platforms. The target application in this work is the CloverLeaf mini-app from Sandia National Laboratory’s Mantevo suite of codes that consists of algorithms of interest from hydrodynamics workloads. Specifically, we present (1) the lessons learnt in re-engineering an industrial representative hydro-dynamics application to utilize the OPS high-level framework and subsequent code generation to obtain a range of parallel implementations, and (2) the performance of the auto-generated OPS versions of CloverLeaf compared to that of the performance of the hand-coded original CloverLeaf implementations on a range of platforms. Benchmarked systems include Intel multi-core CPUs and NVIDIA GPUs, the Archer (Cray XC30) CPU cluster and the Titan (Cray XK7) GPU cluster with different parallelizations (OpenMP, OpenACC, CUDA, OpenCL and MPI). Our results show that the development of parallel HPC applications using a high-level framework such as OPS is no more time consuming nor difficult than writing a one-off parallel program targeting only a single parallel implementation. However the OPS strategy pays off with a highly maintainable single application source, through which multiple parallelizations can be realized, without compromising performance portability on a range of parallel systems.
The Computer Journal | 2013
O. F. J. Perks; David A. Beckingsale; Simon D. Hammond; I. Miller; J. A. Herdman; A. Vadgama; Abhir Bhalerao; Ligang He; Stephen A. Jarvis
The importance of memory performance and capacity is a growing concern for high performance computing laboratories around the world. It has long been recognized that improvements in processor speed exceed the rate of improvement in dynamic random access memory speed and, as a result, memory access times can be the limiting factor in high performance scientific codes. The use of multi-core processors exacerbates this problem with the rapid growth in the number of cores not being matched by similar improvements in memory capacity, increasing the likelihood of memory contention. In this paper, we present WMTools, a lightweight memory tracing tool and analysis framework for parallel codes, which is able to identify peak memory usage and also analyse per-function memory use over time. An evaluation of WMTools, in terms of its effectiveness and also its overheads, is performed using nine established scientific applications/benchmark codes representing a variety of programming languages and scientific domains. We also show how WMTools can be used to automatically generate a parameterized memory model for one of these applications, a two-dimensional non-linear magnetohydrodynamics application, Lare2D. Through the memory model we are able to identify an unexpected growth term which becomes dominant at scale. With a refined model we are able to predict memory consumption with under 7% error.
measurement and modeling of computer systems | 2011
J. A. Herdman; Wayne Gaudin; David Turland; Simon D. Hammond
This paper introduces an industry strength, multi-purpose, benchmark: Shamrock. Developed at the Atomic Weapons Establishment (AWE), Shamrock is a two dimensional (2D) structured hydrocode; one of its aims is to assess the impacts of a change in hardware, and (in conjunction with a larger HPC Benchmark Suite) to provide guidance in procurement of future systems. A suitable test problem is described and executed on a local, high-end, workstation for a range of compilers and MPI implementations. Based on these observations, specific configurations are subsequently built and executed on a selection of HPC architectures, including Intels Nehalem and Westmere micro architectures, IBMs POWER-5, POWER-6, POWER-7, BlueGene/L, BlueGene/P, and AMDs Opteron chip set. Comparisons are made between these architectures, for the Shamrock benchmark, and relative compute resources are specified that deliver similar time to solution, along with their associated power budgets. Additionally, performance comparisons are made for a port of the benchmark to a Nehalem based cluster, accelerated with Tesla C1060 GPUs, with details of the port, and extrapolations to possible performance of the GPU.
ieee international symposium on parallel distributed processing workshops and phd forum | 2010
Simon D. Hammond; Gihan R. Mudalige; J. A. Smith; Jim Davis; Stephen A. Jarvis; J. Holt; I. Miller; J. A. Herdman; A. Vadgama
Modern supercomputers are growing in diversity and complexity - the arrival of technologies such as multi-core processors, general purpose-GPUs and specialised compute accelerators has increased the potential scientific delivery possible from such machines. This is not however without some cost, including significant increases in the sophistication and complexity of supporting operating systems and software libraries. This paper documents the development and application of methods to assess the potential performance of selecting one hardware, operating system (OS) and software stack combination against another. This is of particular interest to supercomputing centres, which rou tinely examine prospective software/architecture combinations and possible machine upgrades. A case study is presented that assesses the potential performance of a particle transport code on AWEs Cray XT3 8,000-core supercomputer running images of the Catamount and the Cray Linux Environment (CLE) operating systems. This work demonstrates that by running a number of small benchmarks on a test machine and network, and observing factors such as operating system noise, it is possible to speculate as to the performance impact of upgrading from one operating system to another on the system as a whole. This use of performance modelling represents an inexpensive method of examining the likely behaviour of a large supercomputer before and after an operating system upgrade; this method is also attractive if it is desirable to minimise system downtime while exploring software-system upgrades. The results show that benchmark tests run on less than 256 cores would suggest that the impact (overhead) of upgrading the operating system to CLE was less than 10%; model projections suggest that this is not the case at scale.