Steven A. Wright | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Steven A. Wright is active.

Explore More

Publication

Featured researches published by Steven A. Wright.

Journal of Parallel and Distributed Computing | 2013

An investigation of the performance portability of OpenCL

Simon J. Pennycook; Simon D. Hammond; Steven A. Wright; J. A. Herdman; I. Miller; Stephen A. Jarvis

This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3-1.5x slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3-3.1x slower on multiple nodes. We also explore the potential performance gains of OpenCLs device fissioning capability, demonstrating up to a 3x speed-up over our original OpenCL implementation.

The Computer Journal | 2012

On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures

Simon J. Pennycook; Simon D. Hammond; Gihan R. Mudalige; Steven A. Wright; Stephen A. Jarvis

In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures.

The Computer Journal | 2013

Parallel File System Analysis Through Application I/O Tracing

Steven A. Wright; Simon D. Hammond; Simon J. Pennycook; Robert F. Bird; J. A. Herdman; I. Miller; A. Vadgama; Abhir Bhalerao; Stephen A. Jarvis

Input/Output (I/O) operations can represent a significant proportion of the run-time of parallel scientific computing applications. Although there have been several advances in file format libraries, file system design and I/O hardware, a growing divergence exists between the performance of parallel file systems and the compute clusters that they support. In this paper, we document the design and application of the RIOT I/O toolkit (RIOT) being developed at the University of Warwick with our industrial partners at the Atomic Weapons Establishment and Sandia National Laboratories. We use the toolkit to assess the performance of three industry-standard I/O benchmarks on three contrasting supercomputers, ranging from a mid-sized commodity cluster to a large-scale proprietary IBM BlueGene/P system. RIOT provides a powerful framework in which to analyse I/O and parallel file system behaviour—we demonstrate, for example, the large file locking overhead of IBMs General Parallel File System, which can consume nearly 30% of the total write time in the FLASH-IO benchmark. Through I/O trace analysis, we also assess the performance of HDF-5 in its default configuration, identifying a bottleneck created by the use of suboptimal Message Passing Interface hints. Furthermore, we investigate the performance gains attributed to the Parallel Log-structured File System (PLFS) being developed by EMC Corporation and the Los Alamos National Laboratory. Our evaluation of PLFS involves two high-performance computing systems with contrasting I/O backplanes and illustrates the varied improvements to I/O that result from the deployment of PLFS (ranging from up to 25× speed-up in I/O performance on a large I/O installation to 2× speed-up on the much smaller installation at the University of Warwick).

Archive | 2014

High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation

Stephen A. Jarvis; Steven A. Wright; Simon D. Hammond

As detailed in recent reports, HPC architectures will continue to change over the next decade in an effort to improve energy efficiency, reliability, and performance. At this time of significant disruption, it is critically important to understand specific application requirements, so that these architectural changes can include features that satisfy the requirements of contemporary extreme-scale scientific applications. To address this need, we have developed a methodology supported by a toolkit that allows us to investigate detailed computation, memory, and communication behaviors of applications at varying levels of resolution. Using this methodology, we performed a broad-based, detailed characterization of 12 contemporary scalable scientific applications and benchmarks. Our analysis reveals numerous behaviors that sometimes contradict conventional wisdom about scientific applications. For example, the results reveal that only one of our applications executes more floating-point instructions than other types of instructions. In another example, we found that communication topologies are very regular, even for applications that, at first glance, should be highly irregular. These observations emphasize the necessity of measurement-driven analysis of real applications, and help prioritize features that should be included in future architectures.

international parallel and distributed processing symposium | 2012

LDPLFS: Improving I/O Performance without Application Modification

Steven A. Wright; Simon D. Hammond; Simon J. Pennycook; Iain Miller; John A. Herdman; Stephen A. Jarvis

Input/Output (I/O) operations can represent a significant proportion of run-time when large scientific applications are run in parallel and at scale. In order to address the growing divergence between processing speeds and I/O performance, the Parallel Log-structured File System (PLFS) has been developed by EMC Corporation and the Los Alamos National Laboratory (LANL) to improve the performance of parallel file activities. Currently, PLFS requires the use of either (i) the FUSE Linux Kernel module, (ii) a modified MPI library with a customised ROMIO MPI-IO library, or (iii) an application rewrite to utilise the PLFS API directly. In this paper we present an alternative method of utilising PLFS in applications. This method employs a dynamic library to intercept the low-level POSIX operations and retarget them to use the equivalents offered by PLFS. We demonstrate our implementation of this approach, named LDPLFS, on a set of standard UNIX tools, as well on as a set of standard parallel I/O intensive mini-applications. The results demonstrate almost equivalent performance to a modified build of ROMIO and improvements over the FUSE-based approach. Furthermore, through our experiments we demonstrate decreased performance in PLFS when ran at scale on the Lustre file system.

EPEW'11 Proceedings of the 8th European conference on Computer Performance Engineering | 2011

Light-Weight parallel i/o analysis at scale

Steven A. Wright; Simon D. Hammond; Simon J. Pennycook; Stephen A. Jarvis

Input/output (I/O) operations can represent a significant proportion of the run-time when large scientific applications are run in parallel. Although there have been advances in the form of file-format libraries, file system design and I/O hardware, a growing divergence exists between the performance of parallel file systems and compute processing rates. In this paper we utilise RIOT, an input/output tracing toolkit being developed at the University of Warwick, to assess the performance of three standard industry I/O benchmarks and mini-applications. We present a case study demonstrating the tracing and analysis capabilities of RIOT at scale, using MPI-IO, Parallel HDF-5 and MPI-IO augmented with the Parallel Log-structured File System (PLFS) middle-ware being developed by the Los Alamos National Laboratory.

international green and sustainable computing conference | 2015

POSE: A mathematical and visual modelling tool to guide energy aware code optimisation

Stephen Roberts; Steven A. Wright; David Lecomber; Christopher January; Jonathan M. R. Byrd; Xavier Oró; Stephen A. Jarvis

Performance engineers are beginning to explore software-level optimisation as a means to reduce the energy consumed when running their codes. This paper presents POSE, a mathematical and visual modelling tool which highlights the relationship between runtime and power consumption. POSE allows developers to assess whether power optimisation is worth pursuing for their codes. We demonstrate POSE by studying the power optimisation characteristics of applications from the Mantevo and Rodinia benchmark suites. We show that LavaMD has the most scope for CPU power optimisation, with improvements in Energy Delay Squared Product (ED2P) of up to 30.59%. Conversely, MiniMD offers the least scope, with improvements to the same metric limited to 7.60%. We also show that no power optimised version of MiniMD operating below 2.3 GHz can match the ED2P performance of the original code running at 3.2 GHz. For LavaMD this limit is marginally less restrictive at 2.2 GHz.

EPEW'12 Proceedings of the 9th European conference on Computer Performance Engineering | 2012

Performance modelling of magnetohydrodynamics codes

Robert F. Bird; Steven A. Wright; David A. Beckingsale; Stephen A. Jarvis

Performance modelling is an important tool utilised by the High Performance Computing industry to accurately predict the run-time of science applications on a variety of different architectures. Performance models aid in procurement decisions and help to highlight areas for possible code optimisations. This paper presents a performance model for a magnetohydrodynamics physics application, Lare. We demonstrate that this model is capable of accurately predicting the run-time of Lare across multiple platforms with an accuracy of 90% (for both strong and weak scaled problems). We then utilise this model to evaluate the performance of future optimisations. The model is generated using SST/macro, the machine level component of the Structural Simulation Toolkit (SST) from Sandia National Laboratories, and is validated on both a commodity cluster located at the University of Warwick and a large scale capability resource located at Lawrence Livermore National Laboratory.

international supercomputing conference | 2017

Metrics for Energy-Aware Software Optimisation

Stephen Roberts; Steven A. Wright; Suhaib A. Fahmy; Stephen A. Jarvis

Energy consumption is rapidly becoming a limiting factor in scientific computing. As a result, hardware manufacturers increasingly prioritise energy efficiency in their processor designs. Performance engineers are also beginning to explore software optimisation and hardware/software co-design as a means to reduce energy consumption. Energy efficiency metrics developed by the hardware community are often re-purposed to guide these software optimisation efforts.

ieee international conference on high performance computing data and analytics | 2012

Towards the Automated Generation of Hard Disk Models through Physical Geometry Discovery

Steven A. Wright; Simon J. Pennycook; Stephen A. Jarvis

As the High Performance Computing industry moves towards the exascale era of computing, parallel scientific and engineering applications are becoming increasingly complex. The use of simulation allows us to predict how an applications performance will change with the adoption of new hardware or software, helping to inform procurement decisions. In this paper, we present a disk simulator designed to predict the performance of read and write operations to a single hard disk drive (HDD). Our simulator uses a geometry discovery benchmark (Diskovery) in order to estimate the data layout of the HDD, as well as the time spent moving the read/write head. We validate our simulator against two different HDDs, using a benchmark designed to simulate common disk read and write patterns, demonstrating accuracy to within 5% of the observed I/O time for sequential operations, and to within 10% of the observed time for seek-heavy workloads.

Explore More