Samuel Bayliss | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Samuel Bayliss is active.

Explore More

Publication

Featured researches published by Samuel Bayliss.

field-programmable technology | 2013

High-level synthesis of dynamic data structures: A case study using Vivado HLS

Felix Winterstein; Samuel Bayliss; George A. Constantinides

High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.

field-programmable logic and applications | 2013

FPGA-based K-means clustering using tree-based data structures

Felix Winterstein; Samuel Bayliss; George A. Constantinides

K-means clustering is a popular technique for partitioning a data set into subsets of similar features. Due to their simple control flow and inherent fine-grain parallelism, K-means algorithms are well suited for hardware implementations, such as on field programmable gate arrays (FPGAs), to accelerate the computationally intensive calculation. However, the available hardware resources in massively parallel implementations are easily exhausted for large problem sizes. This paper presents an FPGA implementation of an efficient variant of K-means clustering which prunes the search space using a binary kd-tree data structure to reduce the computational burden. Our implementation uses on-chip dynamic memory allocation to ensure efficient use of memory resources. We describe the trade-off between data-level parallelism and search space reduction at the expense of increased control overhead. A data-sensitive analysis shows that our approach requires up to five times fewer computational FPGA resources than a conventional massively parallel implementation for the same throughput constraint.

field programmable gate arrays | 2015

MATCHUP: Memory Abstractions for Heap Manipulating Programs

Felix Winterstein; Kermin Fleming; Hsin-Jung Yang; Samuel Bayliss; George A. Constantinides

Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an FPGA accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that generates parallel application-specific multi-scratchpad architectures including on-chip caches. Our program analysis identifies non-overlapping memory regions, supported by private scratchpads, and regions which are shared by parallel units after parallelization and which are supported by coherent scratchpads and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this work is the focus on programs using dynamic, pointer-based data structures and dynamic memory allocation which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. We demonstrate our technique with three case studies of applications using dynamically allocated data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 10x speed-up after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid scratchpad architecture.

field programmable gate arrays | 2012

Optimizing SDRAM bandwidth for custom FPGA loop accelerators

Samuel Bayliss; George A. Constantinides

Memory bandwidth is critical to achieving high performance in many FPGA applications. The bandwidth of SDRAM memories is, however, highly dependent upon the order in which addresses are presented on the SDRAM interface. We present an automated tool for constructing an application specific on-chip memory address sequencer which presents requests to the external memory with an ordering that optimizes off-chip memory bandwidth for fixed on-chip memory resource. Within a class of algorithms described by affine loop nests, this approach can be shown to reduce both the number of requests made to external memory and the overhead associated with those requests. Data presented shows a trade off between the use of on-chip resources and achievable off-chip memory bandwidth where a range of improvements from 3.6x to 4x gain in efficiency on the external memory interface can be gained at a cost of up to a 1.4x increase in the ALUTs dedicated to address generation circuits in an Altera Stratix III device.

field-programmable technology | 2009

Methodology for designing statically scheduled application-specific SDRAM controllers using constrained local search

Samuel Bayliss; George A. Constantinides

This paper presents an automatic method for generating valid SDRAM command schedules which obey the timing restrictions of DDR2 memory from a set of memory references. These generated schedules can be implemented using a static memory controller. A complete knowledge of the sequence of memory references in an application enables the scheduling algorithm to reorder memory commands effectively to reduce latency and improve throughput. While statically scheduled command schedules might be considered too inflexible to be useful in mask-defined devices, they are well suited to implementation within an FPGA where new applications can be targeted by recompilation and reconfiguration. Static SDRAM schedules generated using our approach show a median 4x reduction in the number of memory stall cycles incurred across a selection of benchmarks when compared to schedules produced dynamically by the Altera High Performance Memory Controller.

field-programmable technology | 2006

An FPGA implementation of the simplex algorithm

Samuel Bayliss; Christos-Savvas Bouganis; George A. Constantinides; Wayne Luk

Linear programming is applied to a large variety of scientific computing applications and industrial optimization problems. The Simplex algorithm is widely used for solving linear programs due to its robustness and scalability properties. However, application of the current software implementations of the Simplex algorithm to real-life optimization problems are time consuming when used as the bounding engine within an integer linear programming framework. This work aims to accelerate the Simplex algorithm by proposing a novel parameterizable hardware implementation of the algorithm on an FPGA. Evaluation of the proposed design using real problems demonstrates a speedup of up to 20 times over a highly optimized commercial software implementation running on a 3.4GHz Pentium 4 processor, which is itself 100 times faster than one of the main public domain solvers

field-programmable technology | 2013

SOAP: Structural optimization of arithmetic expressions for high-level synthesis

Xitong Gao; Samuel Bayliss; George A. Constantinides

This paper introduces SOAP, a new tool to automatically optimize the structure of arithmetic expressions for FPGA implementation as part of a high level synthesis flow, taking into account axiomatic rules derived from real arithmetic, such as distributivity, associativity and others. We explicitly target an optimized area/accuracy trade-off, allowing arithmetic expressions to be automatically re-written for this purpose. For the first time, we bring rigorous approaches from software static analysis, specifically formal semantics and abstract interpretation, to bear on source-to-source transformation for high-level synthesis. New abstract semantics are developed to generate a computable subset of equivalent expressions from an original expression. Using formal semantics, we calculate two objectives, the accuracy of computation and an estimate of resource utilization in FPGA. The optimization of these objectives produces a Pareto frontier consisting of a set of expressions. This gives the synthesis tool the flexibility to choose an implementation satisfying constraints on both accuracy and resource usage. We thus go beyond existing literature by not only optimizing the precision requirements of an implementation, but changing the structure of the implementation itself. Using our tool to optimize the structure of a variety of real world and artificially generated examples in single precision, we improve either their accuracy or the resource utilization by up to 60%.

field-programmable custom computing machines | 2015

Offline Synthesis of Online Dependence Testing: Parametric Loop Pipelining for HLS

Junyi Liu; Samuel Bayliss; George A. Constantinides

Loop pipelining is probably the most important optimization method in high-level synthesis (HLS), allowing multiple loop iterations to execute in a pipeline. In this paper, we extend the capability of loop pipelining in HLS to handle loops with uncertain memory behaviours. We extend polyhedral synthesis techniques to the parametric case, offloading the uncertainty to parameter values determined at run time. Our technique then synthesizes lightweight runtime checks to detect the case where a low initiation interval (II) is achievable, resulting in a run-time switch between aggressive (fast) and conservative (slow) execution modes. This optimization is implemented into an automated source-to-source code transformation framework with Xilinx Vivado HLS as one RTL generation backend. Over a suite of benchmarks, experiments show that our optimization can implement transformed pipelines at almost same clock frequency as that generated directly with Vivado HLS, but with approximately 10× faster initiation interval in the fast case, while consuming approximately 60% more resource.

design automation conference | 2014

Datapath Synthesis for Overclocking: Online Arithmetic for Latency-Accuracy Trade-offs

Kan Shi; David Boland; Edward A. Stott; Samuel Bayliss; George A. Constantinides

Digital circuits are currently designed to ensure timing closure. Releasing this constraint by allowing timing violations could lead to significant performance improvements, but conventional forms of computer arithmetic do not fail gracefully when pushed beyond deterministic operation. In this paper we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online operators to allow for graceful degradation. We quantify the impact of timing violation on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations. Using analytical models and empirical FPGA results from an image processing application, we demonstrate an error reduction over 89% and an improvement in SNR of over 20dB for the same clock rate.

field programmable logic and applications | 2014

Area implications of memory partitioning for high-level synthesis on FPGAs

Luca Gallo; Alessandro Cilardo; David B. Thomas; Samuel Bayliss; George A. Constantinides

FPGAs normally have numerous independent memory banks that can be accessed simultaneously, potentially offering a very large memory bandwidth. Adopting a suitable application-based memory partitioning strategy is thus vital to take full advantage of the memory architecture. In addition to improving the potential memory bandwidth, partitioning also affects the area complexity of the generated system because the required steering logic depends on the partitioning scheme. This work describes the area implications of a lattice-based memory partitioning technique in the context of high-level synthesis for FPGAs. Experimental results with a commercial HLS tool show that the proposed partitioning technique improves area efficiency compared to alternative approaches.

Explore More