Harald Devos | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harald Devos is active.

Explore More

Publication

Featured researches published by Harald Devos.

high performance embedded architectures and compilers | 2007

Finding and Applying Loop Transformations for Generating Optimized FPGA Implementations

Harald Devos; Kristof Beyls; Mark Christiaens; Jan Van Campenhout; Erik H. D'Hollander; Dirk Stroobandt

When implementing multimedia applications, solutions in dedicated hardware are chosen only when the required performance or energy-efficiency cannot be met with a software solution. The performance of a hardware design critically depends upon having high levels of parallelism and data locality. Often a long sequence of high-level transformations is needed to sufficiently increase the locality and parallelism. The effect of the transformations is known only after translating the high-level code into a specific design at the circuit level. When the constraints are not met, hardware designers need to redo the high-level loop transformations, and repeat all subsequent translation steps, which leads to long design times. We propose a method to reduce design time through the synergistic combination of techniques (a) to quickly pinpoint the loop transformations that increase locality; (b) to refactor loops in a polyhedral model and check whether a sequence of refactorings is legal; (c) to generate efficient structural VHDL from the optimized refactored algorithm. The implementation of these techniques in a tool suite results in a far shorter design time of hours instead of days or weeks. A 2D-inverse discrete wavelet transform was taken as a case study. The results outperform those of a commercial C-to-VHDL compiler, and compare favorably with existing published approaches.

IEEE Transactions on Multimedia | 2007

Scalable, Wavelet-Based Video: From Server to Hardware-Accelerated Client

Hendrik Eeckhaut; Harald Devos; Peter Lambert; Davy De Schrijver; W. Van Lancker; Vincent Nollet; Prabhat Avasare; Tom Clerckx; Fabio Verdicchio; Mark Christiaens; Peter Schelkens; R. Van de Walle; Dirk Stroobandt

Video source, carrier and client diversification have led the video coding community to develop scalable video codecs supporting efficient decoding at varying resolution, frame rate and quality. Scalable video has several advantages over a nonscalable approach, but a large scale deployment is far from trivial and a lot of open questions remain. To resolve these, we developed a complete video delivery chain for scalable wavelet-based video. This includes a video server, a negotiation framework, a video scaling infrastructure and two scalable video clients, one pure software client and one real-time, hardware accelerated client. This paper describes the complete chain and identifies and quantifies the impact of using scalable video in every link of this chain.

design, automation, and test in europe | 2005

A Hardware-Friendly Wavelet Entropy Codec for Scalable Video

Hendrik Eeckhaut; Harald Devos; Benjamin Schrauwen; Mark Christiaens; Dirk Stroobandt

A scalable video codec provides the ability to produce a smaller video stream with reduced frame rate, resolution or image quality starting from the original encoded video stream with almost no additional computation. This is important for portable devices that have different quality of service (QoS) requirements and power restrictions. Conventional video codecs do not possess this property; reduced quality is obtained through the arduous process of decoding the encoded video stream and recoding it at a lower quality. Producing such a smaller stream has therefore a very high computational cost. In this article, we present the results of our investigation into the hardware implementation of such a scalable video codec. In particular, we found that the implementation of the entropy codec is a significant bottleneck. We present an alternative, hardware friendly algorithm for entropy coding with superior data locality (both temporal and spatial), with a smaller memory footprint and superior compression while maintaining all required scalability properties.

international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2004

Reconfigurable Hardware for a Scalable Wavelet Video Decoder and Its Performance Requirements

Dirk Stroobandt; Hendrik Eeckhaut; Harald Devos; Mark Christiaens; Fabio Verdicchio; Peter Schelkens

Multimedia applications emerge on portable devices everywhere. These applications typically have a number of stringent requirements: (i) a high amount of computational power together with real-time performance and (ii) the flexibility to modify the application or the characteristics of the application at will. The performance requirements often drive the design towards a hardware implementation while the flexibility requirement is better served by a software implementation. In this paper we try to reconcile these two requirements by using an FPGA to implement the performance critical parts of a scalable wavelet video decoder. Through analytical means we first explore the performance and resource requirements. We find that modern FPGAs offer enough computational power to obtain real-time performance of the decoder, but that reaching the necessary memory bandwidth will be a challenge during this design.

field programmable gate arrays | 2010

Automatic tool flow for shift-register-LUT reconfiguration: making run-time reconfiguration fast and easy (abstract only)

Brahim Al Farisi; Karel Bruneel; Harald Devos; Dirk Stroobandt

The Shift-Register-Lut (SRL) functionality is a powerful extension of Xilinx FPGA architectures and has been used successfully in many applications. If routing is kept fixed, these SRLs can also be used for run-time reconfiguration. So far, this technique has mainly been used to reconfigure specialized functions. In contrast, we propose a generic tool flow that uses SRLs for fast run-time reconfiguration of general data folding applications. We show that, in such an automatic toolflow, SRL reconfiguration is over two orders of magnitude faster than run-time reconfiguration using the ICAP. It thus makes run-time reconfiguration viable for applications with a more dynamic behaviour. Our generic tool flow is also very easy to use since the designer only has to annotate slowly varying signals in an RTL HDL description, while the tool flow takes care of all the rest.

high performance embedded architectures and compilers | 2011

Constructing application-specific memory hierarchies on FPGAs

Harald Devos; Jan Van Campenhout; Ingrid Verbauwhede; Dirk Stroobandt

The high performance potential of an FPGA is not fully exploited if a design suffers a memory bottleneck. Therefore, a memory hierarchy is needed to reuse data in on-chip buffer memories and minimize the number of accesses to off-chip memory. Buffer memories not only hide the external memory latency, but can also be used to remap data and augment the on-chip bandwidth through parallel access of multiple buffers. This paper discusses the differences and similarities of memory hierarchies on processor- and on FPGA-based systems and presents a step-by-step methodology to construct a memory hierarchy on an FPGA.

applied reconfigurable computing | 2010

Towards a tighter integration of generated and custom-made hardware

Harald Devos; Wim Meeus; Dirk Stroobandt

Most of today’s high-level synthesis tools offer a fixed set of interfaces to communicate with the outer world. A direct integration of custom IP in the datapath would often be more beneficial than an integration using such communication interfaces. If a certain interface protocol is not offered by the tool, either translation blocks (wrappers) are needed or the code should be written at a lower level. The former solution may hurt the performance, while the latter one is often impossible using an untimed high-level description. In this paper interface protocols or sets of IP core accesses are first described at a low level as sets of operations with scheduling information (macros). During the synthesis process, corresponding function calls are mapped to these macros. This facilitates the integration of custom-made hardware and hardware generated by high-level synthesis tools.

symposium/workshop on electronic design, test and applications | 2008

Embedding Smart Buffers for Window Operations in a Stream-Oriented C-to-VHDL Compiler

Fabian Diet; Erik H. D'Hollander; Kristof Beyls; Harald Devos

Important classes of algorithms which can benefit from the advantages of C-to-VHDL compiling are window operations. These execute a number of instructions on a large amount of array data. Since arrays are usually translated into FPGA block memory structures, it is important to minimize the required number of block memory accesses. Recently, a smart buffer has been introduced, in which a number of past and present array elements can be temporarily stored to be reused over a number of different loop nest iterations. In this paper, the smart buffer approach is analysed for use in the stream- oriented Impulse-C compiler. Experimental automatic generation of VHDL code for this buffer is described. The smart buffer is then linked with the VHDL code generated by the Impulse-C compiler to obtain data efficient designs.

digital systems design | 2010

A Parallel for Loop Memory Template for a High Level Synthesis Compiler

Craig Truett Moore; Wim Meeus; Harald Devos; Dirk Stroobandt

In this paper, a two-enterprise symbiosis Lotka--Volterra system with continuous delay is considered. We first introduce the characteristic of enterprise system, and then investigate the stability of the positive equilibrium and the existence of Hopf bifurcations. The direction and the stability criteria of the bifurcating periodic solutions are obtained by the normal form theory and the center manifold theorem.We propose a parametrized memory template for applications with parallel for loops. The templates parameters reflect important trade-offs made during system design. The template is incorporated in our high level synthesis (HLS) compiler, where the templates parameters are adjusted to the application. The template fits parallel for loops with no loop dependencies and sequential bodies. We found two alternative template implementations using our compiler. In the future, we will develop templates for other types of for loops. These will be added to the compiler and it will identify the template that works best for the application it is compiling. Once a template is selected, the compiler will use design space exploration to select the best combination of template parameters for the targeted hardware and application.

reconfigurable computing and fpgas | 2008

Loop Transformations to Reduce the Dynamic FPGA Recon?guration Overhead

Tom Degryse; Karel Bruneel; Harald Devos; Dirk Stroobandt

Dynamic hardware generation reduces the number of FPGA resources needed and speeds up an application by optimizing the FPGA configuration at run-time for the exact problem at hand. Because of the large overhead associated with dynamic hardware generation, it is important to minimize the number of reconfigurations. In this work, we present a technique to maximize the reuse of a configuration by means of loop transformations. Our approach builds on similar work on temporal data locality optimization. Our experiments on a matrix multiplication benchmark show that we can reduce the number of reconfigurations by an order of magnitude, making dynamic hardware generation techniques much more useful in practice. When we combine our approach with a dynamic hardware generation tool with a very low overhead, so called parameterizable configurations, we can obtain a significant speed up over generic counterparts.

Explore More