Nicholas J. Fraser | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nicholas J. Fraser is active.

Explore More

Publication

Featured researches published by Nicholas J. Fraser.

field programmable gate arrays | 2017

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Yaman Umuroglu; Nicholas J. Fraser; Giulio Gambardella; Michaela Blott; Philip Heng Wai Leong; Magnus Jahre; Kees A. Vissers

Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 μs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 μs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.

high performance embedded architectures and compilers | 2017

Scaling Binarized Neural Networks on Reconfigurable Logic

Nicholas J. Fraser; Yaman Umuroglu; Giulio Gambardella; Michaela Blott; Philip Heng Wai Leong; Magnus Jahre; Kees A. Vissers

Binarized neural networks (BNNs) are gaining interest in the deep learning community due to their significantly lower computational and memory cost. They are particularly well suited to reconfigurable logic devices, which contain an abundance of fine-grained compute resources and can result in smaller, lower power implementations, or conversely in higher classification rates. Towards this end, the FINN framework was recently proposed for building fast and flexible field programmable gate array (FPGA) accelerators for BNNs. FINN utilized a novel set of optimizations that enable efficient mapping of BNNs to hardware and implemented fully connected, non-padded convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. However, FINN was not evaluated on larger topologies due to the size of the chosen FPGA, and exhibited decreased accuracy due to lack of padding. In this paper, we improve upon FINN to show how padding can be employed on BNNs while still maintaining a 1-bit datapath and high accuracy. Based on this technique, we demonstrate numerous experiments to illustrate flexibility and scalability of the approach. In particular, we show that a large BNN requiring 1.2 billion operations per frame running on an ADM-PCIE-8K5 platform can classify images at 12 kFPS with 671 μs latency while drawing less than 41 W board power and classifying CIFAR-10 images at 88.7% accuracy. Our implementation of this network achieves 14.8 trillion operations per second. We believe this is the fastest classification rate reported to date on this benchmark at this level of accuracy.

field-programmable technology | 2013

A low latency kernel recursive least squares processor using FPGA technology

Yeyong Pang; Shaojun Wang; Yu Peng; Nicholas J. Fraser; Philip Heng Wai Leong

The kernel recursive least squares (KRLS) algorithm performs non-linear regression in an online manner, with similar computational requirements to linear techniques. In this paper, an implementation of the KRLS algorithm utilising pipelining and vectorisation for performance; and microcoding for reusability is described. The design can be scaled to allow tradeoffs between capacity, performance and area. Compared with a central processing unit (CPU) and digital signal processor (DSP), the processor improves on execution time, latency and energy consumption by factors of 5, 5 and 12 respectively.

field programmable logic and applications | 2015

A fully pipelined kernel normalised least mean squares processor for accelerated parameter optimisation

Nicholas J. Fraser; Duncan J. M. Moss; JunKyu Lee; Stephen Tridgell; Craig Jin; Philip Heng Wai Leong

Kernel adaptive filters (KAFs) are online machine learning algorithms which are amenable to highly efficient streaming implementations. They require only a single pass through the data during training and can act as universal approximators, i.e. approximate any continuous function with arbitrary accuracy. KAFs are members of a family of kernel methods which apply an implicit nonlinear mapping of input data to a high dimensional feature space, permitting learning algorithms to be expressed entirely as inner products. Such an approach avoids explicit projection into the feature space, enabling computational efficiency. In this paper, we propose the first fully pipelined floating point implementation of the kernel normalised least mean squares algorithm for regression. Independent training tasks necessary for parameter optimisation fill L cycles of latency ensuring the pipeline does not stall. Together with other optimisations to reduce resource utilisation and latency, our core achieves 160 GFLOPS on a Virtex 7 XC7VX485T FPGA, and the PCI-based system implementation is 70× faster than an optimised software implementation on a desktop processor.

field-programmable technology | 2015

Braiding: A scheme for resolving hazards in kernel adaptive filters

Stephen Tridgell; Duncan J. M. Moss; Nicholas J. Fraser; Philip Heng Wai Leong

Computational cost presents a barrier in the application of machine learning algorithms to large-scale real-time learning problems. Kernel adaptive filters (KAFs) have low computational cost with the ability to learn online and are hence favoured for such applications. Unfortunately, dependencies of the outputs on the weight updates prohibit pipelining. This paper introduces a combination of parallel execution and conditional forwarding, called braiding, which overcomes dependencies by expressing the output as a combination of the earlier state and other examples in the pipeline. To demonstrate its utility, braiding is applied to the implementation of classification, regression and novelty detection algorithms based on the Naive Online regularised Risk Minimization Algorithm (NORMA). Fixed point, open source implementations are described which can achieve data rates of around 130 MSamples/s with a latency of 10 to 13 clock cycles. This constitutes a two orders of magnitude increase in throughput and one order of magnitude decrease in latency compared to a single core CPU implementation.

international conference on neural information processing | 2017

Compressing Low Precision Deep Neural Networks Using Sparsity-Induced Regularization in Ternary Networks.

Julian Faraone; Nicholas J. Fraser; Giulio Gambardella; Michaela Blott; Philip Heng Wai Leong

A low precision deep neural network training technique for producing sparse, ternary neural networks is presented. The technique incorporates hard- ware implementation costs during training to achieve significant model compression for inference. Training involves three stages: network training using L2 regularization and a quantization threshold regularizer, quantization pruning, and finally retraining. Resulting networks achieve improved accuracy, reduced memory footprint and reduced computational complexity compared with conventional methods, on MNIST and CIFAR10 datasets. Our networks are up to 98% sparse and 5 & 11 times smaller than equivalent binary and ternary models, translating to significant resource and speed benefits for hardware implementations.

ACM Transactions on Reconfigurable Technology and Systems | 2016

A Microcoded Kernel Recursive Least Squares Processor Using FPGA Technology

Yeyong Pang; Shaojun Wang; Yu Peng; Xiyuan Peng; Nicholas J. Fraser; Philip Heng Wai Leong

Kernel methods utilize linear methods in a nonlinear feature space and combine the advantages of both. Online kernel methods, such as kernel recursive least squares (KRLS) and kernel normalized least mean squares (KNLMS), perform nonlinear regression in a recursive manner, with similar computational requirements to linear techniques. In this article, an architecture for a microcoded kernel method accelerator is described, and high-performance implementations of sliding-window KRLS, fixed-budget KRLS, and KNLMS are presented. The architecture utilizes pipelining and vectorization for performance, and microcoding for reusability. The design can be scaled to allow tradeoffs between capacity, performance, and area. The design is compared with a central processing unit (CPU), digital signal processor (DSP), and Altera OpenCL implementations. In different configurations on an Altera Arria 10 device, our SW-KRLS implementation delivers floating-point throughput of approximately 16 GFLOPs, latency of 5.5μS, and energy consumption of 10− 4 J, these being improvements over a CPU by factors of 12, 17, and 24, respectively.

applied reconfigurable computing | 2018

Accuracy to Throughput Trade-Offs for Reduced Precision Neural Networks on Reconfigurable Logic

Jiang Su; Nicholas J. Fraser; Giulio Gambardella; Michaela Blott; Gianluca Durelli; David B. Thomas; Philip Heng Wai Leong; Peter Y. K. Cheung

Modern Convolutional Neural Networks (CNNs) are typically based on floating point linear algebra based implementations. Recently, reduced precision Neural Networks (NNs) have been gaining popularity as they require significantly less memory and computational resources compared to floating point. This is particularly important in power constrained compute environments. However, in many cases a reduction in precision comes at a small cost to the accuracy of the resultant network. In this work, we investigate the accuracy-throughput trade-off for various parameter precision applied to different types of NN models. We firstly propose a quantization training strategy that allows reduced precision NN inference with a lower memory footprint and competitive model accuracy. Then, we quantitatively formulate the relationship between data representation and hardware efficiency. Our experiments finally provide insightful observation. For example, one of our tests show 32-bit floating point is more hardware efficient than 1-bit parameters to achieve 99% MNIST accuracy. In general, 2-bit and 4-bit fixed point parameters show better hardware trade-off on small-scale datasets like MNIST and CIFAR-10 while 4-bit provide the best trade-off in large-scale tasks like AlexNet on ImageNet dataset within our tested problem domain.

ACM Transactions on Reconfigurable Technology and Systems | 2017

FPGA Implementations of Kernel Normalised Least Mean Squares Processors

Nicholas J. Fraser; JunKyu Lee; Duncan J. M. Moss; Julian Faraone; Stephen Tridgell; Craig Jin; Philip Heng Wai Leong

Kernel adaptive filters (KAFs) are online machine learning algorithms which are amenable to highly efficient streaming implementations. They require only a single pass through the data and can act as universal approximators, i.e. approximate any continuous function with arbitrary accuracy. KAFs are members of a family of kernel methods which apply an implicit non-linear mapping of input data to a high dimensional feature space, permitting learning algorithms to be expressed entirely as inner products. Such an approach avoids explicit projection into the feature space, enabling computational efficiency. In this paper, we propose the first fully pipelined implementation of the kernel normalised least mean squares algorithm for regression. Independent training tasks necessary for hyperparameter optimisation fill pipeline stages, so no stall cycles to resolve dependencies are required. Together with other optimisations to reduce resource utilisation and latency, our core achieves 161 GFLOPS on a Virtex 7 XC7VX485T FPGA for a floating point implementation and 211 GOPS for fixed point. Our PCI Express based floating-point system implementation achieves 80% of the core’s speed, this being a speedup of 10× over an optimised implementation on a desktop processor and 2.66× over a GPU.

international conference on acoustics, speech, and signal processing | 2015

Distributed kernel learning using Kernel Recursive Least Squares

Nicholas J. Fraser; Duncan J. M. Moss; Nicolas Epain; Philip Heng Wai Leong

Constructing accurate models that represent the underlying structure of Big Data is a costly process that usually constitutes a compromise between computation time and model accuracy. Methods addressing these issues often employ parallelisation to handle processing. Many of these methods target the Support Vector Machine (SVM) and provide a significant speed up over batch approaches. However, the convergence of these methods often rely on multiple passes through the data. In this paper, we present a parallelised algorithm that constructs a model equivalent to a serial approach, whilst requiring only a single pass of the data. We first employ the Kernel Recursive Least Squares (KRLS) algorithm to construct several models from subsets of the overall data. We then show that these models can be combined using KRLS to create a single compact model. Our parallelised KRLS methodology significantly improves execution time and demonstrates comparable accuracy when compared to the parallel and serial SVM approaches.

Explore More