David Boland | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Boland is active.

Explore More

Publication

Featured researches published by David Boland.

field-programmable logic and applications | 2008

An FPGA-based implementation of the MINRES algorithm

David Boland; George A. Constantinides

Due to continuous improvements in the resources available on FPGAs, it is becoming increasingly possible to accelerate floating point algorithms. The solution of a system of linear equations forms the basis of many problems in engineering and science, but its calculation is highly time consuming. The minimum residual algorithm (MINRES) is one method to solve this problem, and is highly effective provided the matrix exhibits certain characteristics. This paper examines an IEEE 754 single precision floating point implementation of the MINRES algorithm on an FPGA. It demonstrates that through parallelisation and heavy pipelining of all floating point components it is possible to achieve a sustained performance of up to 53 GFLOPS on the Virtex5-330T. This compares favourably to other hardware implementations of floating point matrix inversion algorithms, and corresponds to an improvement of nearly an order of magnitude compared to a software implementation.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2011

Bounding Variable Values and Round-Off Effects Using Handelman Representations

David Boland; George A. Constantinides

The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. This paper describes a new method to determine the minimum precision required to meet a given error specification for an algorithm consisting of the basic algebraic operations. Using this approach, it is possible to significantly reduce the computational word-length in comparison to existing methods, and this can lead to superior hardware designs. We demonstrate the proposed procedure on an iteration of the conjugate gradient algorithm, achieving proofs of bounds that can translate to global word-length savings ranging from a few bits to proving the existence of ranges that must otherwise be assumed to be unbounded when using competing approaches. We also achieve comparable bounds to recent literature in a small fraction of the execution time, with greater scalability.

field-programmable custom computing machines | 2010

Automated Precision Analysis: A Polynomial Algebraic Approach

David Boland; George A. Constantinides

When migrating an algorithm onto hardware, the potential saving that can be obtained by tuning the precision used in the algorithm to meet a range or error specification is often overlooked; the major reason is that it is hard to choose a number system which can guarantee any such specification can be met. Instead, the problem is mitigated by opting to use IEEE standard single or double precision so as to be ‘no worse’ than a software implementation. However, the flexibility in the number representation is one of the key factors that can only be exploited on FPGAs, unlike GPUs and general purpose processors, and hence ignoring this potential significantly limits the performance achievable on an FPGA. To this end, this paper describes a tool which analyses algorithms with given input ranges under a finite precision to provide information that could be used to tune the hardware to the algorithm specifications. We demonstrate the proposed procedure on an iteration of the conjugate gradient algorithm, achieving a reduction in slices of over 40% when meeting the same error specification found by traditional methods. We also show it achieves comparable bounds to recent literature in a small fraction of the execution time, with greater scalability.

field programmable gate arrays | 2012

A scalable approach for automated precision analysis

David Boland; George A. Constantinides

The freedom over the choice of numerical precision is one of the key factors that can only be exploited throughout the datapath of an FPGA accelerator, providing the ability to trade the accuracy of the final computational result with the silicon area, power, operating frequency, and latency. However, in order to tune the precision used throughout hardware accelerators automatically, a tool is required to verify that the hardware will meet an error or range specification for a given precision. Existing tools to perform this task typically suffer either from a lack of tightness of bounds or require a large execution time when applied to large scale algorithms; in this work, we propose an approach that can both scale to larger examples and obtain tighter bounds, within a smaller execution time, than the existing methods. The approach we describe also provides a user with the ability to trade the quality of bounds with execution time of the procedure, making it suitable within a word-length optimization framework for both small and large-scale algorithms. We demonstrate the use of our approach on instances of iterative algorithms to solve a system of linear equations. We show that because our approach can track how the relative error decreases with increasing precision, unlike the existing methods, we can use it to create smaller hardware with guaranteed numerical properties. This results in a saving of 25% of the area in comparison to optimizing the precision using competing analytical techniques, whilst requiring a smaller execution time than the these methods, and saving almost 80% of area in comparison to adopting IEEE double precision arithmetic.

ACM Transactions on Reconfigurable Technology and Systems | 2011

Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods

David Boland; George A. Constantinides

Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative methods, has drawn particular interest, with recent literature showing large performance improvements over General-Purpose Processors (GPPs) [Lopes and Constantinides 2008]. In several iterative methods, this performance gain is largely a result of parallelization of the matrix-vector multiplication, an operation that occurs in many applications and hence has also been widely studied on FPGAs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006]. However, whilst the performance of matrix-vector multiplication on FPGAs is generally I/O bound [Zhuo and Prasanna 2005], the nature of iterative methods allows the use of on-chip memory buffers to increase the bandwidth, providing the potential for significantly more parallelism [deLorimier and DeHon 2005]. Unfortunately, existing approaches have generally only either been capable of solving large matrices with limited improvement over GPPs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006; deLorimier and DeHon 2005], or achieve high performance for relatively small matrices [Lopes and Constantinides 2008; Boland and Constantinides 2008]. This article proposes hardware designs to take advantage of symmetrical and banded matrix structure, as well as methods to optimize the RAM use, in order to both increase the performance and retain this performance for larger-order matrices.

IEEE Transactions on Multimedia | 2013

A Scalable Precision Analysis Framework

David Boland; George A. Constantinides

In embedded computing, typically some form of silicon area or power budget restricts the potential performance achievable. For algorithms with limited dynamic range, custom hardware accelerators manage to extract significant additional performance for such a budget via mapping operations in the algorithm to fixed-point. However, for complex applications requiring floating-point computation, the potential performance improvement over software is reduced. Nonetheless, custom hardware can still customize the precision of floating-point operators, unlike software which is restricted to IEEE standard single or double precision, to increase the overall performance at the cost of increasing the error observed in the final computational result. Unfortunately, because it is difficult to determine if this error increase is tolerable, this task is rarely performed. We present a new analytical technique to calculate bounds on the range or relative error of output variables, enabling custom hardware accelerators to be tolerant of floating point errors by design. In contrast to existing tools that perform this task, our approach scales to larger examples and obtains tighter bounds, within a smaller execution time. Furthermore, it allows a user to trade the quality of bounds with execution time of the procedure, making it suitable for both small and large-scale algorithms.

field-programmable custom computing machines | 2013

Accuracy-Performance Tradeoffs on an FPGA through Overclocking

Kan Shi; David Boland; George A. Constantinides

Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantization error into the design. In this paper, we demonstrate that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantization errors. Through the use of analytical models and empirical results on a Xilinx Virtex-6 FPGA, we show that a geometric mean reduction of 67.9% to 98.8% in error expectation or a geometric mean improvement of 3.1% to 27.6% in operating frequency can be obtained using this alternative design methodology.

design automation conference | 2014

Datapath Synthesis for Overclocking: Online Arithmetic for Latency-Accuracy Trade-offs

Kan Shi; David Boland; Edward A. Stott; Samuel Bayliss; George A. Constantinides

Digital circuits are currently designed to ensure timing closure. Releasing this constraint by allowing timing violations could lead to significant performance improvements, but conventional forms of computer arithmetic do not fail gracefully when pushed beyond deterministic operation. In this paper we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online operators to allow for graceful degradation. We quantify the impact of timing violation on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations. Using analytical models and empirical FPGA results from an image processing application, we demonstrate an error reduction over 89% and an improvement in SNR of over 20dB for the same clock rate.

field programmable gate arrays | 2013

Word-length optimization beyond straight line code

David Boland; George A. Constantinides

The silicon area benefits that result from word-length optimization have been widely reported by the FPGA community. However, to date, most approaches are restricted to straight line code, or code that can be converted into straight line code using techniques such as loop-unrolling. In this paper, we take the first steps towards creating analytical techniques to optimize the precision used throughout custom FPGA accelerators for algorithms that contain loops with data dependent exit conditions. To achieve this, we build on ideas emanating from the software verification community to prove program termination. Our idea is to apply word-length optimization techniques to find the minimum precision required to guarantee that a loop with data dependent exit conditions will terminate. Without techniques to analyze algorithms containing these types of loops, a hardware designer may elect to implement every arithmetic operator throughout a custom FPGA-based accelerator using IEEE-754 standard single or double precision arithmetic. With this approach, the FPGA accelerator would have comparable accuracy to a software implementation. However, we show that using our new technique to create custom fixed and floating point designs, we can obtain silicon area savings of up to 50% over IEEE standard single precision arithmetic, or 80% over IEEE standard double precision arithmetic, at the same time as providing guarantees that the created hardware designs will work in practice.

applied reconfigurable computing | 2010

Optimising memory bandwidth use for matrix-vector multiplication in iterative methods

David Boland; George A. Constantinides

Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [1, 2, 3]. One class of algorithms to solve these systems, iterative methods, has drawn particular interest, with recent literature showing large performance improvements over general purpose processors (GPPs). In several iterative methods, this performance gain is largely a result of parallelisation of the matrixvector multiplication, an operation that occurs in many applications and hence has also been widely studied on FPGAs [4, 5]. However, whilst the performance of matrix-vector multiplication on FPGAs is generally I/O bound [4], the nature of iterative methods allows the use of onchip memory buffers to increase the bandwidth, providing the potential for significantly more parallelism [6]. Unfortunately, existing approaches have generally only either been capable of solving large matrices with limited improvement over GPPs [4,5,6], or achieve high performance for relatively small matrices [2,3]. This paper proposes hardware designs to take advantage of symmetrical and banded matrix structure, as well as methods to optimise the RAM use, in order to both increase the performance and retain this performance for larger order matrices.

Explore More