Pavel Burovskiy
Imperial College London
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pavel Burovskiy.
field-programmable custom computing machines | 2015
Paul Grigoras; Pavel Burovskiy; Eddie Hung; Wayne Luk
Sparse matrix vector multiplication (SpMV) is an important kernel in many areas of scientific computing, especially as a building block for iterative linear system solvers. We study how loss less nonzero compression can be used to overcome memory bandwidth limitations in FPGA-based SpMV implementations. We introduce a dictionary-based compression algorithm which reduces redundant nonzero values to improve memory bandwidth without reducing computation efficiency by making use of spare FPGA resources. We show how a sparse matrix in the CSR format can be converted to the proposed storage format on the CPU and that average compression ratios of 1.14 - 1.40 and up to 2.65 times can be achieved, over CSR, for relevant matrices in our benchmarks.
field programmable logic and applications | 2014
Gary C.T. Chow; Paul Grigoras; Pavel Burovskiy; Wayne Luk
The conjugate gradient (CG) is one of the most widely used iterative methods for solving systems of linear equations. However, parallelizing CG for large sparse systems is difficult due to the inherent irregularity in memory access pattern. We propose a novel processor architecture for the sparse conjugate gradient method. The architecture consists of multiple processing elements and memory banks, and is able to compute efficiently both sparse matrix-vector multiplication, and other dense vector operations. A Beneš permutation network with an optimised control scheme is introduced to reduce memory bank conflicts without expensive logic. We describe a heuristics for offline scheduling, the effect of which is captured in a parametric model for estimating the performance of designs generated from our approach.
international conference on cluster computing | 2013
Jeremy Cohen; David Moxey; Chris D. Cantwell; Pavel Burovskiy; John Darlington; Spencer J. Sherwin
As the capabilities and diversity of computational platforms continue to grow, scientific software is becoming ever more complex in order to target resources effectively. In the libhpc project we are developing a suite of tools and services to simplify job description and execution on heterogeneous infrastructures. This paper presents Nekkloud, a web-based software environment, built on aspects of the libhpc framework, for running the Nektar++ high-order finite element code on both cluster and cloud platforms, while improving the accessibility of the software for end-users and improving the user experience. Nektar++ provides a suite of solvers which span a range of scientific domains, ensuring that Nekkloud has a broad range of use cases. We describe the Nekkloud environment, its use and its ability to target both local campus cluster infrastructure and cloud computing resources, enabling users to make better use of the facilities available to them.
field programmable gate arrays | 2016
Paul Grigoras; Pavel Burovskiy; Wayne Luk
Sparse matrix vector multiplication (SpMV) is an important kernel in many scientific applications. To improve the performance and applicability of FPGA based SpMV, we propose an approach for exploiting properties of the input matrix to generate optimised custom architectures. The architectures generated by our approach are between 3.8 to 48 times faster than the worst case architectures for each matrix, showing the benefits of instance specific design for SpMV.
field programmable logic and applications | 2015
Pavel Burovskiy; Paul Grigoras; Spencer J. Sherwin; Wayne Luk
The Finite Element Method (FEM) is a common numerical technique used for solving Partial Differential Equations (PDEs) on complex domain geometries. Large and unstructured FEM meshes are used to represent the computation domains which makes an efficient mapping of the Finite Element Method onto FPGAs particularly challenging. The focus of this paper is on assembly mapping, a key kernel of FEM, which induces the sparse and unstructured nature of the problem. We translate FEM vector assembly mapping into data access scheduling to perform vector assembly directly on the FPGA, as part of the hardware pipeline. We show how to efficiently partition the problem into dense and sparse sub-problems which map well onto FPGAs. The proposed approach, implemented on a single FPGA could outperform highly optimised FEM software running on two Xeon E5-2640 processors.
application-specific systems, architectures, and processors | 2015
Andreea Ingrid Funie; Paul Grigoras; Pavel Burovskiy; Wayne Luk; Mark Salmon
Over the past years, examining financial markets has become a crucial part of both the trading and regulatory processes. Recently, genetic programs have been used to identify patterns in financial markets which may lead to more advanced trading strategies. We investigate the use of Field Programmable Gate Arrays to accelerate the evaluation of the fitness function which is an important kernel in genetic programming. Our pipelined design makes use of the massive amounts of parallelism available on chip to evaluate the fitness of multiple genetic programs simultaneously. An evaluation of our designs on both synthetic and historical market data shows that our implementation evaluates fitness function up to 21.56 times faster than a multi-threaded C++11 implementation running on two six-core Intel Xeon E5-2640 processors using OpenMP.
field programmable logic and applications | 2016
Paul Grigoras; Pavel Burovskiy; Wayne Luk; Spencer J. Sherwin
Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems.
field programmable logic and applications | 2014
Pavel Burovskiy; Stephen Girdlestone; Craig Davies; Spencer J. Sherwin; Wayne Luk
Most of the efforts in the FPGA community related to sparse linear algebra focus on increasing the degree of internal parallelism in matrix-vector multiply kernels. We propose a parametrisable dataflow architecture presenting an alternative and complementary approach to support acceleration of banded sparse linear algebra problems which benefit from building a Krylov subspace. We use banded structure of a matrix A to overlap the computations Ax, A2x, ..., Akx by building a pipeline of matrix-vector multiplication processing elements (PEs) each performing Aix. Due to on-chip data locality, FLOPS rate sustainable by such pipeline scales linearly with k. Our approach enables trade-off between the number k of overlapped matrix power actions and the level of parallelism in a PE. We illustrate our approach for Google PageRank computation by power iteration for large banded single precision sparse matrices. Our design scales up to 32 sequential PEs with floating point accumulation and 80 PEs with fixed point accumulation on Stratix V D8 FPGA. With 80 single-pipe fixed point PEs clocked at 160Mhz, our design sustains 12.7 GFLOPS.
signal processing systems | 2018
Andreea Ingrid Funie; Paul Grigoras; Pavel Burovskiy; Wayne Luk; Mark Salmon
Genetic programming can be used to identify complex patterns in financial markets which may lead to more advanced trading strategies. However, the computationally intensive nature of genetic programming makes it difficult to apply to real world problems, particularly in real-time constrained scenarios. In this work we propose the use of Field Programmable Gate Array technology to accelerate the fitness evaluation step, one of the most computationally demanding operations in genetic programming. We propose to develop a fully-pipelined, mixed precision design using run-time reconfiguration to accelerate fitness evaluation. We show that run-time reconfiguration can reduce resource consumption by a factor of 2 compared to previous solutions on certain configurations. The proposed design is up to 22 times faster than an optimised, multithreaded software implementation while achieving comparable financial returns.
international parallel and distributed processing symposium | 2017
Anna Maria Nestorov; Enrico Reggiani; Hristina Palikareva; Pavel Burovskiy; Tobias Becker; Marco D. Santambrogio
Computational finance is a challenging application domain with ever-increasing performance requirements. Driven by the competition between companies, computational finance pushes High Performance Computing (HPC) technology to its limits. In this paper, we consider Asian options which are financial derivatives whose payoff is determined by the average price of their underlying asset at predetermined observation points rather than on the single value at expiration time. Due to this path dependency, their pricing is computationally expensive and is therefore a suitable candidate for dataflow acceleration. This paper introduces an application for Asian option pricing based on Currans approximation method that exploits a dataflow-oriented development approach, employing dedicated optimisations and replacing conventional floating-point with fixed-point formats wherever possible. The implementation targets a Maxeler server-class HPC system consisting of a CPU server node and Maxeler dataflow engines encapsulating Altera Stratix V FPGAs. The application has been evaluated on two different data sets and achieves a speed-up of 111x and 278.3x compared to a single-threaded software implementation, and 4x and 9.2x compared to a multi-threaded software implementation running on a dual socket CPU server with 12-core Intel Xeon E5-2697 v2 CPUs with up to 48 hyper-threads in total.