Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Francis P. Russell is active.

Publication


Featured researches published by Francis P. Russell.


Science of Computer Programming | 2011

DESOLA: An active linear algebra library using delayed evaluation and runtime code generation

Francis P. Russell; Michael R. Mellor; Paul H. J. Kelly; Olav Beckmann

Active libraries can be defined as libraries which play an active part in the compilation, in particular, the optimisation of their client code. This paper explores the implementation of an active dense linear algebra library by delaying evaluation of expressions built using library calls, then generating code at runtime for the compositions that occur. The key optimisations in this context are loop fusion and array contraction. Our prototype C++ implementation, DESOLA, automatically fuses loops arising from different client calls, identifies unnecessary intermediate temporaries, and contracts temporary arrays to scalars. Performance is evaluated using a benchmark suite of linear solvers from ITL (Iterative Template Library), and is compared with MTL (Matrix Template Library), ATLAS (Automatically Tuned Linear Algebra) and IMKL (Intel Math Kernel Library). Excluding runtime compilation overheads (caching means they occur only on the first iteration), for larger matrix sizes, performance matches or exceeds MTL; when fusion of matrix operations occurs, performance exceeds that of ATLAS and IMKL.


field-programmable custom computing machines | 2015

Architectures and Precision Analysis for Modelling Atmospheric Variables with Chaotic Behaviour

Francis P. Russell; Peter D. Düben; Xinyu Niu; Wayne Luk; T. N. Palmer

The computationally intensive nature of atmospheric modelling is an ideal target for hardware acceleration. Performance of hardware designs can be improved through the use of reduced precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced precision optimisation for simulating chaotic systems, targeting atmospheric modelling in which even minor changes in arithmetic behaviour can have a significant impact on system behaviour. Hence, standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide FPGA designs of a chaotic system, and analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainly in input parameters, a single Xilinx Virtex 6 SXT475 FPGA can be 13 times faster and 23 times more power efficient than a 6-core Intel Xeon X5650 processor.


field-programmable technology | 2015

Lower precision for higher accuracy: Precision and resolution exploration for shallow water equations

James Stanley Targett; Xinyu Niu; Francis P. Russell; Wayne Luk; Stephen Jeffress; Peter D. Düben

Accurate forecasts of future climate with numerical models of atmosphere and ocean are of vital importance. However, forecast quality is often limited by the available computational power. This paper investigates the acceleration of a C-grid shallow water model through the use of reduced precision targeting FPGA technology. Using a double-gyre scenario, we show that the mantissa length of variables can be reduced to 14 bits without affecting the accuracy beyond the error inherent in the model. Our reduced precision FPGA implementation runs 5.4 times faster than a double precision FPGA implementation, and 12 times faster than a multi-threaded CPU implementation. Moreover, our reduced precision FPGA implementation uses 39 times less energy than the CPU implementation and can compute a 100×100 grid for the same energy that the CPU implementation would take for a 29×29 grid.


ACM Transactions on Mathematical Software | 2013

Optimized code generation for finite element local assembly using symbolic manipulation

Francis P. Russell; Paul H. J. Kelly

Automated code generators for finite element local assembly have facilitated exploration of alternative implementation strategies within generated code. However, even for a theoretical performance indicator such as operation count, an optimal strategy for local assembly is unknown. We explore a code generation strategy based on symbolic integration and polynomial common subexpression elimination (CSE). We present our implementation of a local assembly code generator using these techniques. We systematically evaluate the approach, measuring operation count, execution time and numerical error using a benchmark suite of synthetic variational forms, comparing against the FEniCS Form Compiler (FFC). Our benchmark forms span complexities chosen to expose the performance characteristics of different code generation approaches. We show that it is possible with additional computational cost, to consistently achieve much of, and sometimes substantially exceed, the performance of alternative approaches without compromising precision. Although the approach of using symbolic integration and CSE for optimizing local assembly is not new, we distinguish our work through our strategies for maintaining numerical precision and detecting common subexpressions. We discuss the benefits of the symbolic approach for inferring numerical relationships, and analyze the relationship to other proposed techniques which also have greater computational complexity than those of FFC.


Computer Physics Communications | 2015

Optimised three-dimensional Fourier interpolation: An analysis of techniques and application to a linear-scaling density functional theory code

Francis P. Russell; Karl A. Wilkinson; Paul H. J. Kelly; Chris-Kriton Skylaris

The Fourier interpolation of 3D data-sets is a performance critical operation in many fields, including certain forms of image processing and density functional theory (DFT) quantum chemistry codes based on plane wave basis sets, to which this paper is targeted. In this paper we describe three different algorithms for performing this operation built from standard discrete Fourier transform operations, and derive theoretical operation counts. The algorithms compared consist of the most straightforward implementation and two that exploit techniques such as phase-shifts and knowledge of zero padding to reduce computational cost. Through a library implementation (tintl) we explore the performance characteristics of these algorithms and the performance impact of different implementation choices on actual hardware. We present comparisons within the linear-scaling DFT code ONETEP where we replace the existing interpolation implementation with our library implementation configured to choose the most efficient algorithm. Within the ONETEP Fourier interpolation stages, we demonstrate speed-ups of over 1.55×.


Computer Physics Communications | 2017

Exploiting the chaotic behaviour of atmospheric models with reconfigurable architectures

Francis P. Russell; Peter D. Düben; Xinyu Niu; Wayne Luk; T. N. Palmer

Abstract Reconfigurable architectures are becoming mainstream: Amazon, Microsoft and IBM are supporting such architectures in their data centres. The computationally intensive nature of atmospheric modelling is an attractive target for hardware acceleration using reconfigurable computing. Performance of hardware designs can be improved through the use of reduced-precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced-precision optimisation for simulating chaotic systems, targeting atmospheric modelling, in which even minor changes in arithmetic behaviour will cause simulations to diverge quickly. The possibility of equally valid simulations having differing outcomes means that standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide reconfigurable designs of a chaotic system, then analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainty in input parameters, the throughput and energy efficiency of a single-precision chaotic system implemented on a Xilinx Virtex-6 SX475T Field Programmable Gate Array (FPGA) can be more than doubled.


Journal of Advances in Modeling Earth Systems | 2015

On the use of programmable hardware and reduced numerical precision in earth-system modeling

Peter D. Düben; Francis P. Russell; Xinyu Niu; Wayne Luk; T. N. Palmer


Computer Physics Communications | 2016

GiMMiK—Generating bespoke matrix multiplication kernels for accelerators: Application to high-order Computational Fluid Dynamics

Bartosz Wozniak; Freddie D. Witherden; Francis P. Russell; Peter E. Vincent; Paul H. J. Kelly


Archive | 2011

An Active-Library Based Investigation into the Performance Optimisation of Linear Algebra and the Finite Element Method

Francis P. Russell


application-specific systems, architectures, and processors | 2018

From Tensor Algebra to Hardware Accelerators: Generating Streaming Architectures for Solving Partial Differential Equations

Francis P. Russell; James Stanley Targett; Wayne Luk

Collaboration


Dive into the Francis P. Russell's collaboration.

Top Co-Authors

Avatar

Wayne Luk

Imperial College London

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Xinyu Niu

Imperial College London

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge