Tom J Deakin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tom J Deakin is active.

Explore More

Publication

Featured researches published by Tom J Deakin.

ieee international conference on high performance computing, data, and analytics | 2016

GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models

Tom J Deakin; James Price; Matt J Martineau; Simon N McIntosh-Smith

Many scientific codes consist of memory bandwidth bound kernels — the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before results are written back to memory. One major advantage of many-core devices such as General Purpose Graphics Processing Units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. However, as with CPUs, this peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance.

International Journal of High Performance Computing Applications | 2018

An Improved Parallelism Scheme for Deterministic Discrete Ordinates Transport

Tom J Deakin; Simon N McIntosh-Smith; Matt J Martineau; Wayne Gaudin

In this paper we demonstrate techniques for increasing the node-level parallelism of a deterministic discrete ordinates neutral particle transport algorithm on a structured mesh to exploit many-core technologies. Transport calculations form a large part of the computational workload of physical simulations and so good performance is vital for the simulations to complete in reasonable time. We will demonstrate our approach utilizing the SNAP mini-app, which gives a simplified implementation of the full transport algorithm but remains similar enough to the real algorithm to act as a useful proxy for research purposes. We present an OpenCL implementation of our improved algorithm which achieves a speedup of up to 2.5 × on a many-core GPGPU device compared to a state-of-the-art multi-core node for the transport sweep, and up to 4 × compared to the multi-core CPUs in the largest GPU enabled supercomputer; the first time this scale of speedup has been achieved for algorithms of this class. We then discuss ways to express our scheme in OpenMP 4.0 and demonstrate the performance on an Intel Knights Corner Xeon Phi compared to the original scheme.

ieee international conference on high performance computing, data, and analytics | 2016

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

Tom J Deakin; Simon N McIntosh-Smith; Wayne Gaudin

Time-dependent deterministic discrete ordinates transport codes are an important class of application which provide significant challenges for large, many-core systems. One such challenge is the large memory capacity needed by the solve step, which requires us to have a scalable solution in order to have enough node-level memory to store all the data. In our previous work, we demonstrated the first implementation which showed a significant performance benefit for single node solves using GPUs. In this paper we extend our work to large problems and demonstrate the scalability of our solution on two Petascale GPU-based supercomputers: Titan at Oak Ridge and Piz Daint at CSCS. Our results show that our improved node-level parallelism scheme scales just as well across large systems as previous approaches when using the tried and tested KBA domain decomposition technique. We validate our results against an improved performance model which predicts the runtime of the main ‘sweep’ routine when running on different hardware, including CPUs or GPUs.

international conference on cluster computing | 2015

Expressing Parallelism on Many-Core for Deterministic Discrete Ordinates Transport

Tom J Deakin; Simon N McIntosh-Smith; Wayne Gaudin

In this paper we demonstrate techniques for increasing the node-level parallelism of a deterministic discrete ordinates neutral particle transport algorithm on a structured mesh to exploit many-core technologies. Transport calculations form a large part of the computational workload of physical simulations and so good performance is vital for the simulations to complete in reasonable time. We will demonstrate our approach utilizing the SNAP mini-app, which gives a simplified implementation of the full transport algorithm but remains similar enough to the real algorithm to act as a useful proxy for research purposes. We present an OpenCL implementation of our improved algorithm which demonstrates a speedup of up to 2.5x the transport sweep performance on a many-core GPGPU device compared to a state-of-the-art multi-core node, the first time this scale of speedup has been achieved for algorithms of this class.

international conference on cluster computing | 2017

TeaLeaf: A Mini-Application to Enable Design-Space Explorations for Iterative Sparse Linear Solvers

Simon N McIntosh-Smith; Matt J Martineau; Tom J Deakin; Grzegorz Pawelczak; Wayne Gaudin; Paul Garrett; Wei Liu; Richard Smedley-Stevenson; David Beckingsale

Iterative sparse linear solvers are an important class of algorithm in high performance computing, and form a crucial component of many scientific codes. As intra and inter node parallelism continues to increase rapidly, the design of new, scalable solvers which can target next generation architectures becomes increasingly important. In this work we present TeaLeaf, a recent mini-app constructed to explore design space choices for highly scalable solvers. We then use TeaLeaf to compare the standard CG algorithm with a Chebyshev Polynomially Preconditioned Conjugate Gradient (CPPCG) iterative sparse linear solver. CPPCG is a communication-avoiding algorithm, requiring less global communication than previous approaches. TeaLeaf includes support for many-core processors, such as GPUs and Xeon Phi, and we include strong-scaling results across a range of world-leading Petascale supercomputers, including Titan and Piz Daint.

ieee international conference on high performance computing, data, and analytics | 2017

On the mitigation of cache hostile memory access patterns on many-core CPU architectures

Tom J Deakin; Wayne Gaudin; Simon N McIntosh-Smith

Kernels with low arithmetic intensity with memory footprint exceeding cache sizes are typically categorised as memory bandwidth bound. Kernels of this class are typically limited by hardware memory bandwidth. In this work we contribute a simple memory access pattern, derived from a widely-used upwinded stencil-style benchmark, which presents significant challenges for cache-based architectures. The problem appears to grow worse as CPU core counts increase, and the pattern in its initial form shows no benefit from the new high-bandwidth memory now appearing on the Intel Xeon Phi (Knights Landing) family of processors. We describe the memory access scenarios which appear to be causing lower than expected cache performance, before presenting optimisations to mitigate the problem. These optimisations result in useful effective memory bandwidth and runtime improvements by up to 4X on cache based architectures. Results are presented on the Intel Xeon (Broadwell) and Xeon Phi (Knights Landing) processors.

conference on high performance computing (supercomputing) | 2015