Paul Grigoras | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paul Grigoras is active.

Explore More

Publication

Featured researches published by Paul Grigoras.

field-programmable custom computing machines | 2015

Accelerating SpMV on FPGAs by Compressing Nonzero Values

Paul Grigoras; Pavel Burovskiy; Eddie Hung; Wayne Luk

Sparse matrix vector multiplication (SpMV) is an important kernel in many areas of scientific computing, especially as a building block for iterative linear system solvers. We study how loss less nonzero compression can be used to overcome memory bandwidth limitations in FPGA-based SpMV implementations. We introduce a dictionary-based compression algorithm which reduces redundant nonzero values to improve memory bandwidth without reducing computation efficiency by making use of spare FPGA resources. We show how a sparse matrix in the CSR format can be converted to the proposed storage format on the CPU and that average compression ratios of 1.14 - 1.40 and up to 2.65 times can be achieved, over CSR, for relevant matrices in our benchmarks.

field programmable logic and applications | 2014

An efficient sparse conjugate gradient solver using a Beneš permutation network

Gary C.T. Chow; Paul Grigoras; Pavel Burovskiy; Wayne Luk

The conjugate gradient (CG) is one of the most widely used iterative methods for solving systems of linear equations. However, parallelizing CG for large sparse systems is difficult due to the inherent irregularity in memory access pattern. We propose a novel processor architecture for the sparse conjugate gradient method. The architecture consists of multiple processing elements and memory banks, and is able to compute efficiently both sparse matrix-vector multiplication, and other dense vector operations. A Beneš permutation network with an optimised control scheme is introduced to reduce memory bank conflicts without expensive logic. We describe a heuristics for offline scheduling, the effect of which is captured in a parametric model for estimating the performance of designs generated from our approach.

applied reconfigurable computing | 2014

HARNESS Project: Managing Heterogeneous Computing Resources for a Cloud Platform

José Gabriel F. Coutinho; Oliver Pell; E. O’Neill; Peter Sanders; John McGlone; Paul Grigoras; Wayne Luk; Carmelo Ragusa

Most cloud service offerings are based on homogeneous commodity resources, such as large numbers of inexpensive machines interconnected by off-the-shelf networking equipment and disk drives, to provide low-cost application hosting. However, cloud service providers have reached a limit in satisfying performance and cost requirements for important classes of applications, such as geo-exploration and real-time business analytics. The HARNESS project aims to fill this gap by developing architectural principles that enable the next generation cloud platforms to incorporate heterogeneous technologies such as reconfigurable Dataflow Engines (DFEs), programmable routers, and SSDs, and provide as a result vastly increased performance, reduced energy consumption, and lower cost profiles. In this paper we focus on three challenges for supporting heterogeneous computing resources in the context of a cloud platform, namely: (1) cross-optimisation of heterogeneous computing resources, (2) resource virtualisation and (3) programming heterogeneous platforms.

international symposium on parallel and distributed processing and applications | 2014

Elastic Management of Reconfigurable Accelerators

Paul Grigoras; Max Tottenham; Xinyu Niu; José Gabriel F. Coutinho; Wayne Luk

This paper presents a runtime system for reconfigurable accelerators that supports elastic management: it enables effective sharing of accelerator resources across multiple applications. For each application, this runtime system allocates an appropriate amount of resources to satisfy its quality-of-service requirements, while minimising the overall execution time for a collection of applications. The effectiveness of this runtime system is due to a set of scheduling algorithms and strategies customised for different types of workloads. We demonstrate our approach by implementing a dynamic Monte Carlo bond options pricing design.

field programmable gate arrays | 2016

CASK: Open-Source Custom Architectures for Sparse Kernels

Paul Grigoras; Pavel Burovskiy; Wayne Luk

Sparse matrix vector multiplication (SpMV) is an important kernel in many scientific applications. To improve the performance and applicability of FPGA based SpMV, we propose an approach for exploiting properties of the input matrix to generate optimised custom architectures. The architectures generated by our approach are between 3.8 to 48 times faster than the worst case architectures for each matrix, showing the benefits of instance specific design for SpMV.

field programmable logic and applications | 2015

Efficient assembly for high order unstructured FEM meshes

Pavel Burovskiy; Paul Grigoras; Spencer J. Sherwin; Wayne Luk

The Finite Element Method (FEM) is a common numerical technique used for solving Partial Differential Equations (PDEs) on complex domain geometries. Large and unstructured FEM meshes are used to represent the computation domains which makes an efficient mapping of the Finite Element Method onto FPGAs particularly challenging. The focus of this paper is on assembly mapping, a key kernel of FEM, which induces the sparse and unstructured nature of the problem. We translate FEM vector assembly mapping into data access scheduling to perform vector assembly directly on the FPGA, as part of the hardware pipeline. We show how to efficiently partition the problem into dense and sparse sub-problems which map well onto FPGAs. The proposed approach, implemented on a single FPGA could outperform highly optimised FEM software running on two Xeon E5-2640 processors.

application-specific systems, architectures, and processors | 2015

Reconfigurable acceleration of fitness evaluation in trading strategies

Andreea Ingrid Funie; Paul Grigoras; Pavel Burovskiy; Wayne Luk; Mark Salmon

Over the past years, examining financial markets has become a crucial part of both the trading and regulatory processes. Recently, genetic programs have been used to identify patterns in financial markets which may lead to more advanced trading strategies. We investigate the use of Field Programmable Gate Arrays to accelerate the evaluation of the fitness function which is an important kernel in genetic programming. Our pipelined design makes use of the massive amounts of parallelism available on chip to evaluate the fitness of multiple genetic programs simultaneously. An evaluation of our designs on both synthetic and historical market data shows that our implementation evaluates fitness function up to 21.56 times faster than a multi-threaded C++11 implementation running on two six-core Intel Xeon E5-2640 processors using OpenMP.

application specific systems architectures and processors | 2013

Aspect driven compilation for dataflow designs

Paul Grigoras; Xinyu Niu; José Gabriel F. Coutinho; Wayne Luk; Jacob A. Bower; Oliver Pell

This paper proposes a novel hardware compilation approach targeting dataflow designs. This approach is based on aspect-oriented programming to decouple design development from design optimisation, thus improving portability and developer productivity while enabling automated exploration of design trade-offs to enhance performance. We introduce FAST, a language for specifying dataflow designs that supports our approach. Optimisation strategies for the generated designs are specified in FAST, making use of facilities in the domain-specific aspect-oriented language, LARA. Our approach is demonstrated by implementing various seismic imaging designs for ReverseTime Migration (RTM), which have performance comparable to state-of-the-art FPGA implementations while being produced with improved developer productivity.

field programmable logic and applications | 2016

Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

Paul Grigoras; Pavel Burovskiy; Wayne Luk; Spencer J. Sherwin

Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems.

signal processing systems | 2018

Run-time Reconfigurable Acceleration for Genetic Programming Fitness Evaluation in Trading Strategies

Andreea Ingrid Funie; Paul Grigoras; Pavel Burovskiy; Wayne Luk; Mark Salmon

Genetic programming can be used to identify complex patterns in financial markets which may lead to more advanced trading strategies. However, the computationally intensive nature of genetic programming makes it difficult to apply to real world problems, particularly in real-time constrained scenarios. In this work we propose the use of Field Programmable Gate Array technology to accelerate the fitness evaluation step, one of the most computationally demanding operations in genetic programming. We propose to develop a fully-pipelined, mixed precision design using run-time reconfiguration to accelerate fitness evaluation. We show that run-time reconfiguration can reduce resource consumption by a factor of 2 compared to previous solutions on certain configurations. The proposed design is up to 22 times faster than an optimised, multithreaded software implementation while achieving comparable financial returns.

Explore More