Francis P. Russell
Imperial College London
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Francis P. Russell.
Science of Computer Programming | 2011
Francis P. Russell; Michael R. Mellor; Paul H. J. Kelly; Olav Beckmann
Active libraries can be defined as libraries which play an active part in the compilation, in particular, the optimisation of their client code. This paper explores the implementation of an active dense linear algebra library by delaying evaluation of expressions built using library calls, then generating code at runtime for the compositions that occur. The key optimisations in this context are loop fusion and array contraction. Our prototype C++ implementation, DESOLA, automatically fuses loops arising from different client calls, identifies unnecessary intermediate temporaries, and contracts temporary arrays to scalars. Performance is evaluated using a benchmark suite of linear solvers from ITL (Iterative Template Library), and is compared with MTL (Matrix Template Library), ATLAS (Automatically Tuned Linear Algebra) and IMKL (Intel Math Kernel Library). Excluding runtime compilation overheads (caching means they occur only on the first iteration), for larger matrix sizes, performance matches or exceeds MTL; when fusion of matrix operations occurs, performance exceeds that of ATLAS and IMKL.
field-programmable custom computing machines | 2015
Francis P. Russell; Peter D. Düben; Xinyu Niu; Wayne Luk; T. N. Palmer
The computationally intensive nature of atmospheric modelling is an ideal target for hardware acceleration. Performance of hardware designs can be improved through the use of reduced precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced precision optimisation for simulating chaotic systems, targeting atmospheric modelling in which even minor changes in arithmetic behaviour can have a significant impact on system behaviour. Hence, standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide FPGA designs of a chaotic system, and analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainly in input parameters, a single Xilinx Virtex 6 SXT475 FPGA can be 13 times faster and 23 times more power efficient than a 6-core Intel Xeon X5650 processor.
field-programmable technology | 2015
James Stanley Targett; Xinyu Niu; Francis P. Russell; Wayne Luk; Stephen Jeffress; Peter D. Düben
Accurate forecasts of future climate with numerical models of atmosphere and ocean are of vital importance. However, forecast quality is often limited by the available computational power. This paper investigates the acceleration of a C-grid shallow water model through the use of reduced precision targeting FPGA technology. Using a double-gyre scenario, we show that the mantissa length of variables can be reduced to 14 bits without affecting the accuracy beyond the error inherent in the model. Our reduced precision FPGA implementation runs 5.4 times faster than a double precision FPGA implementation, and 12 times faster than a multi-threaded CPU implementation. Moreover, our reduced precision FPGA implementation uses 39 times less energy than the CPU implementation and can compute a 100×100 grid for the same energy that the CPU implementation would take for a 29×29 grid.
ACM Transactions on Mathematical Software | 2013
Francis P. Russell; Paul H. J. Kelly
Automated code generators for finite element local assembly have facilitated exploration of alternative implementation strategies within generated code. However, even for a theoretical performance indicator such as operation count, an optimal strategy for local assembly is unknown. We explore a code generation strategy based on symbolic integration and polynomial common subexpression elimination (CSE). We present our implementation of a local assembly code generator using these techniques. We systematically evaluate the approach, measuring operation count, execution time and numerical error using a benchmark suite of synthetic variational forms, comparing against the FEniCS Form Compiler (FFC). Our benchmark forms span complexities chosen to expose the performance characteristics of different code generation approaches. We show that it is possible with additional computational cost, to consistently achieve much of, and sometimes substantially exceed, the performance of alternative approaches without compromising precision. Although the approach of using symbolic integration and CSE for optimizing local assembly is not new, we distinguish our work through our strategies for maintaining numerical precision and detecting common subexpressions. We discuss the benefits of the symbolic approach for inferring numerical relationships, and analyze the relationship to other proposed techniques which also have greater computational complexity than those of FFC.
Computer Physics Communications | 2015
Francis P. Russell; Karl A. Wilkinson; Paul H. J. Kelly; Chris-Kriton Skylaris
The Fourier interpolation of 3D data-sets is a performance critical operation in many fields, including certain forms of image processing and density functional theory (DFT) quantum chemistry codes based on plane wave basis sets, to which this paper is targeted. In this paper we describe three different algorithms for performing this operation built from standard discrete Fourier transform operations, and derive theoretical operation counts. The algorithms compared consist of the most straightforward implementation and two that exploit techniques such as phase-shifts and knowledge of zero padding to reduce computational cost. Through a library implementation (tintl) we explore the performance characteristics of these algorithms and the performance impact of different implementation choices on actual hardware. We present comparisons within the linear-scaling DFT code ONETEP where we replace the existing interpolation implementation with our library implementation configured to choose the most efficient algorithm. Within the ONETEP Fourier interpolation stages, we demonstrate speed-ups of over 1.55×.
Computer Physics Communications | 2017
Francis P. Russell; Peter D. Düben; Xinyu Niu; Wayne Luk; T. N. Palmer
Abstract Reconfigurable architectures are becoming mainstream: Amazon, Microsoft and IBM are supporting such architectures in their data centres. The computationally intensive nature of atmospheric modelling is an attractive target for hardware acceleration using reconfigurable computing. Performance of hardware designs can be improved through the use of reduced-precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced-precision optimisation for simulating chaotic systems, targeting atmospheric modelling, in which even minor changes in arithmetic behaviour will cause simulations to diverge quickly. The possibility of equally valid simulations having differing outcomes means that standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide reconfigurable designs of a chaotic system, then analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainty in input parameters, the throughput and energy efficiency of a single-precision chaotic system implemented on a Xilinx Virtex-6 SX475T Field Programmable Gate Array (FPGA) can be more than doubled.
Journal of Advances in Modeling Earth Systems | 2015
Peter D. Düben; Francis P. Russell; Xinyu Niu; Wayne Luk; T. N. Palmer
Computer Physics Communications | 2016
Bartosz Wozniak; Freddie D. Witherden; Francis P. Russell; Peter E. Vincent; Paul H. J. Kelly
Archive | 2011
Francis P. Russell
application-specific systems, architectures, and processors | 2018
Francis P. Russell; James Stanley Targett; Wayne Luk