Felix Winterstein | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Felix Winterstein is active.

Explore More

Publication

Featured researches published by Felix Winterstein.

field-programmable technology | 2013

High-level synthesis of dynamic data structures: A case study using Vivado HLS

Felix Winterstein; Samuel Bayliss; George A. Constantinides

High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.

field-programmable logic and applications | 2013

FPGA-based K-means clustering using tree-based data structures

Felix Winterstein; Samuel Bayliss; George A. Constantinides

K-means clustering is a popular technique for partitioning a data set into subsets of similar features. Due to their simple control flow and inherent fine-grain parallelism, K-means algorithms are well suited for hardware implementations, such as on field programmable gate arrays (FPGAs), to accelerate the computationally intensive calculation. However, the available hardware resources in massively parallel implementations are easily exhausted for large problem sizes. This paper presents an FPGA implementation of an efficient variant of K-means clustering which prunes the search space using a binary kd-tree data structure to reduce the computational burden. Our implementation uses on-chip dynamic memory allocation to ensure efficient use of memory resources. We describe the trade-off between data-level parallelism and search space reduction at the expense of increased control overhead. A data-sensitive analysis shows that our approach requires up to five times fewer computational FPGA resources than a conventional massively parallel implementation for the same throughput constraint.

field programmable gate arrays | 2015

MATCHUP: Memory Abstractions for Heap Manipulating Programs

Felix Winterstein; Kermin Fleming; Hsin-Jung Yang; Samuel Bayliss; George A. Constantinides

Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an FPGA accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that generates parallel application-specific multi-scratchpad architectures including on-chip caches. Our program analysis identifies non-overlapping memory regions, supported by private scratchpads, and regions which are shared by parallel units after parallelization and which are supported by coherent scratchpads and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this work is the focus on programs using dynamic, pointer-based data structures and dynamic memory allocation which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. We demonstrate our technique with three case studies of applications using dynamically allocated data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 10x speed-up after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid scratchpad architecture.

field programmable gate arrays | 2016

LMC: Automatic Resource-Aware Program-Optimized Memory Partitioning

Hsin-Jung Yang; Kermin Fleming; Michael Adler; Felix Winterstein; Joel S. Emer

As FPGAs have grown in size and capacity, FPGA memory systems have become both richer and more diverse in order to support the increased computational capacity of FPGA fabrics. Using these resources, and using them well, has become commensurately more difficult, especially in the context of legacy designs ported from smaller, simpler FPGA systems. This growing complexity necessitates resource-aware compilers that can make good use of memory resources on behalf of the programmer. In this work, we introduce the LEAP Memory Compiler (LMC), which can synthesize application-optimized cache networks for systems with multiple memory resources, enabling user programs to automatically take advantage of the expanded memory capabilities of modern FPGA systems. In our experiments, the optimized cache network achieves up to 49% performance gains for throughput-oriented applications and 15% performance gains for latency-oriented applications, while increasing design area by less than 6% of the total chip area.

ACM Transactions on Reconfigurable Technology and Systems | 2016

Separation Logic for High-Level Synthesis

Felix Winterstein; Samuel Bayliss; George A. Constantinides

High-Level Synthesis (HLS) promises a significant shortening of the FPGA design cycle by raising the abstraction level of the design entry to high-level languages such as C/C++. However, applications using dynamic, pointer-based data structures and dynamic memory allocation remain difficult to implement well, yet such constructs are widely used in software. Automated optimizations that leverage the memory bandwidth of FPGAs by distributing the application data over separate banks of on-chip memory are often ineffective in the presence of dynamic data structures due to the lack of an automated analysis of pointer-based memory accesses. In this work, we take a step toward closing this gap. We present a static analysis for pointer-manipulating programs that automatically splits heap-allocated data structures into disjoint, independent regions. The analysis leverages recent advances in separation logic, a theoretical framework for reasoning about heap-allocated data that has been successfully applied in recent software verification tools. Our algorithm focuses on dynamic data structures accessed in loops and is accompanied by automated source-to-source transformations that enable automatic loop parallelization and memory partitioning by off-the-shelf HLS tools. We demonstrate the successful loop parallelization and memory partitioning by our tool flow using three real-life applications that build, traverse, update, and dispose of dynamically allocated data structures. Our case studies, comparing the automatically parallelized to the direct HLS implementations, show an average latency reduction by a factor of 2 × across our benchmarks.

field-programmable technology | 2015

Custom-sized caches in application-specific memory hierarchies

Felix Winterstein; Kermin Fleming; Hsin-Jung Yang; John Wickerson; George A. Constantinides

Developing FPGA implementations with an input specification in a high-level programming language such as C/C++ or OpenCL allows for a substantially shortened design cycle compared to a design entry at register transfer level. This work targets high-level synthesis (HLS) implementations that process large amounts of data and therefore require access to an off-chip memory. We leverage the customizability of the FPGA on-chip memory to automatically construct a multi-cache architecture in order to enhance the performance of the interface between parallel functional units of the HLS core and an external memory. Our focus is on automatic cache sizing. Firstly, our technique determines and uses up unused left-over block RAM resources for the construction of on-chip caches. Secondly, we devise a high-level cache performance estimation based on the memory access trace of the program. We use this memory trace to find a heterogeneous configuration of cache sizes, tailored to the applications memory access characteristic, that maximizes the performance of the multi-cache system subject to an on-chip memory resource constraint. We evaluate our technique with three benchmark implementations on an FPGA board and obtain a reduction in execution latency of up to 2× (1.5× on average) when compared to a one-size-fits-all cache sizing. We also quantify the impact of our automatically generated cache system on the overall energy consumption of the implementation.

field programmable logic and applications | 2015

(FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies

Hsin-Jung Yang; Kermin Fleming; Felix Winterstein; Michael Adler; Joel S. Emer

High-level abstractions separate algorithm design from platform implementation, allowing programmers to focus on algorithms while building increasingly complex systems. This separation also provides system programmers and compilers an opportunity to optimize platform services for each application. In FPGAs, this platform-level malleability extends to the memory system: unlike general-purpose processors, in which memory hardware is fixed at design time, the capacity, associativity, and topology of FPGA memory systems may all be tuned to improve application performance. Since application kernels often use few memory resources, substantial memory capacity may be available to the platform for use on behalf of the user program. In this work, we perform an initial exploration of methods for automating the construction of these application-specific memory hierarchies. Although exploiting spare resources can be beneficial, naïvely consuming all memory resources may cause frequency degradation. To relieve timing pressure in large BRAM structures, we provide microarchitectural techniques to trade memory latency for design frequency. We demonstrate, by examining both hand-assembled and HLS-compiled benchmarks, that our application-optimized memory system can improve pre-existing application runtime by 25% on average.

field-programmable custom computing machines | 2013

Accelerating De Bruijn Graph-Based Genome Assembly for High-Throughput Short Read Data

Eduardo Aguilar Pelaez; Samuel Bayliss; Alex I. Smith; Felix Winterstein; Dan R. Ghica; David B. Thomas; George A. Constantinides

Emerging next-generation sequencing technologies have opened up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, the generated reads are significantly shorter compared to the traditional Sanger shotgun sequencing method. This poses challenges for de novo assembly algorithms in terms of both accuracy and efficiency. And due to the continuing explosive growth of short read databases, there is a high demand to accelerate the often repeated long-runtime assembly task. In this paper, we present a scalable parallel algorithm to accelerate the de Bruijn graph-based genome assembly for high-throughput short read data.This work demonstrates the capabilities of a high-level synthesis tool-chain that allows the compilation of higher order functional programs to gate-level hardware descriptions. Higher order programming allows functions to take functions as parameters. In a hardware context, the latency-insensitive interfaces generated between compiled modules enable late-binding with libraries of pre-existing functions at the place-and-route compilation stage. We demonstrate the completeness and utility of our approach using a case study, a recursive k-means clustering algorithm. The algorithm features complex data-dependent control flow and opportunities to exploit both coarse and fine-grained parallelism.

field programmable gate arrays | 2017

Automatic Construction of Program-Optimized FPGA Memory Networks

Hsin-Jung Yang; Kermin Fleming; Felix Winterstein; Annie I. Chen; Michael Adler; Joel S. Emer

Memory systems play a key role in the performance of FPGA applications. As FPGA deployments move towards design entry points that are more serial, memory latency has become a serious design consideration. For these applications, memory network optimization is essential in improving performance. In this paper, we examine the automatic, program-optimized construction of low-latency memory networks. We design a feedback-driven network compiler, which constructs an optimized memory network based on the target programs memory access behavior measured via a newly designed network profiler. In our test applications, the compiler-optimized networks provide a 45% performance gain on average over baseline memory networks by minimizing the impact of network latency on program performance.

Archive | 2017

High-Level Synthesis of Dynamic Data Structures

Felix Winterstein

High-level synthesis promises significant shortening of the design cycle compared to a design entry at RTL. However, many high-level synthesis implementations require extensive code alterations to ensure synthesisability and to achieve a quality of results comparable to handwritten RTL designs. These are especially important for programs with ‘irregular control flow’ and ‘complicated data dependencies’. In this chapter, we describe these terms in detail and elaborate on their implications for efficient high-level synthesis within the scope of a case study.

Explore More