Shane T. Fleming | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shane T. Fleming is active.

Explore More

Publication

Featured researches published by Shane T. Fleming.

design, automation, and test in europe | 2015

Transparent linking of compiled software and synthesized hardware

David B. Thomas; Shane T. Fleming; George A. Constantinides; Dan R. Ghica

Modern heterogeneous devices contain tightly coupled CPU and FPGA logic, allowing low latency access to accelerators. However, designers of the system need to treat accelerated functions specially, with device specific code for instantiating, configuring, and executing accelerators. We present a system level linker, which allows functions in hardware and software to be linked together to create heterogeneous systems. The linker works with post-compilation and post-synthesis components, allowing the designer to transparently move functions between devices simply by linking in either hardware or software object files. The linker places no special emphasis on the software, allowing computation to be initiated from within hardware, with function calls to software to provide services such as file access. A strong type-system ensures that individual code artifacts can be written using the conventions of that domain (C, HLS, VHDL), while allowing direct and transparent linking.

field-programmable logic and applications | 2013

FPGA based control for real time systems

Shane T. Fleming; David B. Thomas

Real time systems must guarantee tasks can be completed by a given deadline. Typically they are designed around the worst case execution time (WCET) of tasks in the system. In general this creates systems with excess slack compared to the average case. Since, real time systems are often embedded devices which are typically battery operated, developing systems with large amounts of slack is undesirable because more slack means more energy usage. There exist scheduling methods that try to adapt to the current environment, for example adaptive reservation scheduling, which assigns a dynamic fraction of the computational resources to each process. However these and more advanced scheduling techniques are rarely adopted in practice due to their high computational overhead. My research hypothesis is that the overheads of complex scheduling and power saving techniques in real time systems can be reduced through developing a coprocessor in the FPGA fabric. Recent developments in reconfigurable device technology include the introduction of new hybrid FPGA/CPU chips, such as the Xilinx Zynq extensible processing platform, where an ARM core is coupled to an FPGA fabric using an AXI bus. By locating the FPGA and CPU on the same die it possible to obtain low-latency power-efficient communication between user logic in the FPGA and tasks on the CPU. This paper presents my preliminary work, which uses a software application with real time deadlines, and creates a controller in the FPGA fabric to dynamically scale the operating frequency of the CPUs, while still guaranteeing that the deadlines can be met. This coprocessor is totally agnostic to the software running on the CPU and provided that some measure of slackness, which is the deadline period subtracted from the execution time, can be obtained it will be able to scale the frequency in a safe manner and reduce dynamic power consumption.

design automation conference | 2016

StitchUp: automatic control flow protection for high level synthesis circuits

Shane T. Fleming; David B. Thomas

Soft-error detection in FPGAs typically requires replication, doubling the required area. We propose an approach which distinguishes between tolerable errors in data-flow, such-as arithmetic, and intolerable errors in control-flow, such as branches and their data-dependencies. This approach is demonstrated in a new high-level synthesis compiler pass called StitchUp, which precisely identifies the control critical parts of the design, then automatically replicates only that part. We applied StitchUp to the CHStone benchmark suite and performed exhaustive hardware fault injection in each case, finding that all control-flow errors were detected while only requiring 1% circuit area overhead in the best case.

field programmable gate arrays | 2015

System-level Linking of Synthesised Hardware and Compiled Software Using a Higher-order Type System

Shane T. Fleming; David B. Thomas; George A. Constantinides; Dan R. Ghica

Devices with tightly coupled CPUs and FPGA logic allow for the implementation of heterogeneous applications which combine multiple components written in hardware and software languages, including first-party source code and third-party IP. Flexibility in component relationships is important, so that the system designer can move components between software and hardware as the application design evolves. This paper presents a system-level type system and linker, which allows functions in software and hardware components to be directly linked at link time, without requiring any modification or recompilation of the components. The type system is designed to be language agnostic, and exhibits higher-order features, to enables design patterns such as notifications and callbacks to software from within hardware functions. We demonstrate the system through a number of case studies which link compiled software against synthesised hardware in the Xilinx Zynq platform.

Archive | 2015

Is High Level Synthesis Ready for Business? An Option Pricing Case Study

Gordon Inggs; Shane T. Fleming; David B. Thomas; Wayne Luk

High-Level Synthesis (HLS) tools for Field Programmable Gate Arrays (FPGAs) have made considerable progress in recent years, and are now ready for deployment in an industrial setting. This claim is supported by a case study of the pricing of a benchmark of Black-Scholes (BS) and Heston model-based options using a Monte Carlo Simulations approach. Using a high-level synthesis (HLS) tool such as Xilinx’s Vivado HLS, Altera’s OpenCL SDK or Maxeler’s MaxCompiler, a functionally correct FPGA implementation can be developed from a high level description based upon the MapReduce programming model in a short time. This direct source code implementation is however unlikely to meet performance expectations, and so a series of optimisations can be applied to use the target FPGA’s resource more efficiently. When a combination of task and pipeline parallelism as well as C-slowing optimisations are considered for the problem in this case study, the Vivado HLS implementation is 9.5 times faster than a sequential CPU implementation, the Altera OpenCL 221 times faster and Maxeler 204 times, the sort of acceleration expected of custom architectures. Compared to the 31 times improvement shown by an optimised Multicore CPU implementation, the 60 times improvement by a GPU and 207 times by a Xeon Phi, these results suggest that HLS is indeed ready for business.

field programmable logic and applications | 2015

PushPush: Seamless integration of hardware and software objects via function calls over AXI

Shane T. Fleming; Ivan Beretta; David B. Thomas; George A. Constantinides; Dan R. Ghica

FPGA systems are moving towards a system-on-chip model, both at the architectural level and in the development tools. Developers are able to design and implement IP using a mixture of HLS, RTL, and software, then integrate them with third-party IP cores and hardened CPUs using one or more shared memory buses. This allows functionality to be easily connected together at the bus level, but accessing IP core functionality requires designers to support each components protocol and co-ordinate hardware from a CPU. This paper presents a protocol called PushPush, which allows HLS, RTL, and software components to expose functionality as strongly typed functions, and allows any component to access functions exposed by any other component in the system. The protocol is designed for maximum efficiency in memory buses such as AXI and Avalon, reducing each function call to two burst writes delivering both data and control, minimising bus traffic and eliminating the need for global polling or interrupt delivery. We demonstrate this approach in a Zynq environment, using components written in C++ (ARM/Linux), C (Microblaze), Vivado HLS (Logic), and Verity (Logic). We show that any component can call functions exposed by any other component, without knowing where or how that function is located. Performance is at least 1 million function calls/sec between any pair of components, and rises to 4 million function calls/sec between pairs of Vivado HLS components.

field programmable custom computing machines | 2017

Using Runahead Execution to Hide Memory Latency in High Level Synthesis

Shane T. Fleming; David B. Thomas

Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before its required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.

Archive | 2016

A Power-Aware Adaptive FDIR Framework Using Heterogeneous System-on-Chip Modules

Shane T. Fleming; David B. Thomas; Felix Winterstein

Reconfigurable field-programmable gate arrays (FPGAs) offer high processing rates at low power consumption and flexibility through reconfiguration which makes them widely-used devices in embedded systems today. Spacecrafts are highly constrained embedded systems with an increasing demand for high processing throughput. Hence, leveraging the power/energy efficiency and flexibility of reprogrammable FPGAs in space-borne processors is of great interest to the space sector. However, SRAM-based FPGAs in space applications are particularly susceptible to radiation effects as single event upsets (SEUs) in the configuration memory can cause the reconfiguration of the chip and an undesired modification of the circuit. Traditionally, this problem is addressed by fault detecting and scrubbing, i.e. repeated reprogramming of the configuration bitstream. A major disadvantage of this technique is the considerable down-time of the processing system during reprogramming which can lead to the loss of payload data or even affect critical onboard control tasks. This work proposes a novel fault detection, isolation and recovery (FDIR) framework that optimizes the worst case response, power consumption and availability of the processing system together. Our FDIR scheme and fault handling is transparent to the payload application as the system autonomously ensures nearly full availability of the payload processor at all times. A key feature of our technique is the explicit use of commercial-off-the-shelf heterogeneous systems such as Xilinx’s Zynq or Altera’s Cyclone V system-on-chip devices, which tightly couple FPGA fabric with embedded hard processor cores. This chapter describes the current implementation of our FDIR framework. We present experiment results obtained under fault injection and demonstrate that our framework ensures nearly full availability, whereas the conventional scrubbing approach can degrade to 20 % availability for high fault rates. An in-orbit demonstration and validation of the proposed technique will follow during an experiment campaign onboard OPS-SAT, a European Space Agency satellite mission set to launch in 2016.

field programmable logic and applications | 2014

Heterogeneous Heartbeats: A framework for dynamic management of Autonomous SoCs

Shane T. Fleming; David B. Thomas

Modern computer systems are formed from many interacting systems and heterogeneous components, that face increasing constraints on performance, power consumption, and temperature. Such systems have complex run-time dynamics which cannot easily be predicted or modelled at design-time, creating a need for online dynamic systems management. The Heartbeats API is a popular open source project which provides a standardised way for applications to monitor and publish their progress in multi-core CPU systems, but it does not allow hardware components to be monitored or to observe the progress of other components of the system. This paper presents work which extends the capacities of the Heartbeats API across the whole system while maintaining backwards compatibility with the legacy software API. To demonstrate the frameworks capabilities an Autonomous Underwater Vehicle (AUV) case study is explored, where a power-aware HW/SW image processing application is implemented on a reconfigurable SoC and an approximate energy saving of 30% is observed for an example input video. Current progress is also discussed on some applications which build upon the framework, including an CubeSat experiment for an Adaptive Heterogeneous FDIR system that will launch in 2016 by the European Space Agency.

applied reconfigurable computing | 2013

Hardware acceleration of matrix multiplication over small prime finite fields

Shane T. Fleming; David B. Thomas

Dense matrix-matrix multiplication over small finite fields is a common operation in many application domains, such as cryptography, random numbers, and error correcting codes. This paper shows that FPGAs have the potential to greatly accelerate this time consuming operation, and in particular that systolic array based approaches are both practical and efficient when using large modern devices. A number of finite-field specific architectural optimisations are introduced, allowing n×n matrices to be processed in O(n) cycles, for matrix sizes up to n=350. Comparison with optimised software implementations on a single-core CPU shows that an FPGA accelerator can achieve between 80x and 700x speed-up over a Virtex-7 XC7V200T for GF(2k), but for GF(3) and larger finite fields can provide practical speed-ups of 1000x or more.

Explore More