Is this you? Create Your Porfile

Oliver Reiche

University of Erlangen-Nuremberg

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Oliver Reiche is active.

Explore More

Publication

Featured researches published by Oliver Reiche.

IEEE Transactions on Parallel and Distributed Systems | 2016

HIPA cc : A Domain-Specific Language and Compiler for Image Processing

Richard Membarth; Oliver Reiche; Frank Hannig; Jürgen Teich; Mario Körner; Wieland Eckert

Domain-specific languages (DSLs) provide high-level and domain-specific abstractions that allow expressive and concise algorithm descriptions. Since the description in a DSL hides also the properties of the target hardware, DSLs are a promising path to target different parallel and heterogeneous hardware from the same algorithm description. In theory, the DSL description can capture all characteristics of the algorithm that are required to generate highly efficient parallel implementations. However, most frameworks do not make use of this knowledge and the performance cannot reach that of optimized library implementations. In this article, we present the HIPAcc framework, a DSL and source-to-source compiler for image processing. We show that domain knowledge can be captured in the language and that this knowledge enables us to generate tailored implementations for a given target architecture. Back ends for CUDA, OpenCL, and Renderscript allow us to target discrete graphics processing units (GPUs) as well as mobile, embedded GPUs. Exploiting the captured domain knowledge, we can generate specialized algorithm variants that reach the maximal achievable performance due to the peak memory bandwidth. These implementations outperform state-of-the-art domain-specific languages and libraries significantly.

international conference on hardware/software codesign and system synthesis | 2014

Code generation from a domain-specific language for C-based HLS of hardware accelerators

Oliver Reiche; Moritz Schmid; Frank Hannig; Richard Membarth; Jürgen Teich

As todays computer architectures are becoming more and more heterogeneous, a plethora of options including CPUs, GPUs, DSPs, reconfigurable logic (FPGAs), and other application-specific processors come into consideration for close-to-sensor processing. Especially, in the domain of image processing on mobile devices, among numerous design challenges, a very stringent energy budget is of utmost importance, making embedded GPUs and FPGAs ideal targets for implementation. Recently, the HIPAcc framework was proposed as a means for automatic code generation of image processing algorithms for embedded GPUs, based on a Domain-Specific Language (DSL). Despite of huge advancements in High-Level Synthesis (HLS) for FPGAs, designers are still required to have detailed knowledge about coding techniques and the targeted architecture to achieve efficient solutions. As a remedy, in this work, we propose code generation techniques for C-based HLS from a common high-level DSL description targeting FPGAs. Our approach includes FPGA-specific memory architectures for handling point and local operators, numerous high-level transformations, and automatic test bench generation. We evaluate our approach by comparing the resulting hardware accelerators to existing frameworks in terms of performance and resource requirements. Moreover, we assess the achieved energy efficiency in contrast to software implementations, generated by HIPAcc from the same code base, executed on GPUs.

design, automation, and test in europe | 2014

Code generation for embedded heterogeneous architectures on android

Richard Membarth; Oliver Reiche; Frank Hannig; Jürgen Teich

The success of Android is based on its unified Java programming model that allows to write platform-independent programs for a variety of different target platforms. However, this comes at the cost of performance. As a consequence, Google introduced APIs that allow to write native applications and to exploit multiple cores as well as embedded GPUs for compute-intensive parts. This paper proposes code generation techniques in order to target the Renderscript and Filterscript APIs. Renderscript harnesses multi-core CPUs and unified shader GPUs, while the more restricted Filterscript also supports GPUs with earlier shader models. Our techniques focus on image processing applications and allow to target these APIs and OpenCL from a common description. We further supersede memory transfers by sharing the same memory region among different processing elements on HSA platforms. As reference, we use an embedded platform hosting a multi-core ARM CPU and an ARM Mali GPU. We show that our generated source code is faster than native implementations in OpenCV as well as the pre-implemented script intrinsics provided by Google for acceleration on the embedded GPU.

Journal of Parallel and Distributed Computing | 2014

Towards a performance-portable description of geometric multigrid algorithms using a domain-specific language

Richard Membarth; Oliver Reiche; Christian Schmitt; Frank Hannig; Jürgen Teich; Markus Stürmer; Harald Köstler

Abstract High Performance Computing (HPC) systems are nowadays more and more heterogeneous. Different processor types can be found on a single node including accelerators such as Graphics Processing Units (GPUs). To cope with the challenge of programming such complex systems, this work presents a domain-specific approach to automatically generate code tailored to different processor types. Low-level CUDA and OpenCL code is generated from a high-level description of an algorithm specified in a Domain-Specific Language (DSL) instead of writing hand-tuned code for GPU accelerators. The DSL is part of the Heterogeneous Image Processing Acceleration ( HIPA cc ) framework and was extended in this work to handle grid hierarchies in order to model different cycle types. Language constructs are introduced to process and represent data at different resolutions. This allows to describe image processing algorithms that work on image pyramids as well as multigrid methods in the stencil domain. By decoupling the algorithm from its schedule, the proposed approach allows to generate efficient stencil code implementations. Our results show that similar performance compared to hand-tuned codes can be achieved.

field programmable logic and applications | 2016

FPGA-based accelerator design from a domain-specific language

M. Akif Ozkan; Oliver Reiche; Frank Hannig; Jürgen Teich

A large portion of image processing applications often come with stringent requirements regarding performance, energy efficiency, and power. FPGAs have proven to be among the most suitable architectures for algorithms that can be processed in a streaming pipeline. Yet, designing imaging systems for FPGAs remains a very time consuming task. High-Level Synthesis, which has significantly improved due to recent advancements, promises to overcome this obstacle. In particular, Altera OpenCL is a handy solution for employing an FPGA in a heterogeneous system as it covers all device communication. However, to obtain efficient hardware implementations, extreme code modifications, contradicting OpenCLs data-parallel programming paradigm, are necessary. In this work, we explore the programming methodology that yields significantly better hardware implementations for the Altera Offline Compiler. We furthermore designed a compiler back end for a domain-specific source-to-source compiler to leverage the algorithm description to a higher level and generate highly optimized OpenCL code. Moreover, we advanced the compiler to support arbitrary bit width operations, which are fundamental to hardware designs. We evaluate our approach by discussing the resulting implementations throughout an extensive application set and comparing them with example designs, provided by Altera. In addition, as we can derive multiple implementations for completely different target platforms from the same domain-specific language source code, we present a comparison of the achieved implementations in contrast to GPU implementations.

application-specific systems, architectures, and processors | 2015

Loop coarsening in C-based High-Level Synthesis

Moritz Schmid; Oliver Reiche; Frank Hannig; Jürgen Teich

Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP), the support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines, consisting of point and local operators. In addition to well known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows to process multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc by loop coarsening and compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from the exact same code base. Moreover, we demonstrate the advantages of code generation for algorithm development by outlining how design space exploration enabled by HIPAcc can yield a more efficient implementation than hand-coded VHDL.

Journal of Systems Architecture | 2015

Synthesis and optimization of image processing accelerators using domain knowledge

Oliver Reiche; Konrad Häublein; Marc Reichenbach; Moritz Schmid; Frank Hannig; Jürgen Teich; Dietmar Fey

In the domain of image processing, often real-time constraints are required. In particular, in safety-critical applications, timing is of utmost importance. A common approach to maintain real-time capabilities is to offload computations to dedicated hardware accelerators, such as Field Programmable Gate Arrays (FPGAs). Designing such architectures is per se already a challenging task, but finding the right design point between achieving as much throughput as necessary while spending as few resources as possible is an even bigger challenge.To address this design challenge in the domain of image processing, several approaches have been presented that introduce an additional layer of abstraction between the developer and the actual target hardware. One approach is to use a Domain-Specific Language (DSL) to generate highly optimized code for synthesis by general purpose High-Level Synthesis (HLS) frameworks. Another approach is to instantiate a generic VHDL IP-Core library for local imaging operators. Elevating the description of image algorithms to such a higher abstraction level can significantly reduce the complexity for designing hardware accelerators targeting FPGAs. We provide a comparison of results for both approaches, a non-expert algorithm developer can achieve. Furthermore, we present an automatic optimization process to give the algorithm developer even more control over trading execution time for resource usage, that could be applied on top of both approaches. To evaluate our optimization procedure, we compare the resulting FPGA accelerators to highly optimized Graphics Processing Unit (GPU) implementations of several image filters relevant for close-to-sensor image and video processing with stringent real-time constraints, such as in the automotive domain.

signal processing systems | 2018

Loop Parallelization Techniques for FPGA Accelerator Synthesis

Oliver Reiche; M. Akif Ozkan; Frank Hannig; Jürgen Teich; Moritz Schmid

Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPU), all generated from exactly the same code base.

languages, compilers, and tools for embedded systems | 2017

Auto-vectorization for image processing DSLs

Oliver Reiche; Christof Kobylko; Frank Hannig; Jürgen Teich

The parallelization of programs and distributing their workloads to multiple threads can be a challenging task. In addition to multi-threading, harnessing vector units in CPUs proves highly desirable. However, employing vector units to speed up programs can be quite tedious. Either a program developer solely relies on the auto-vectorization capabilities of the compiler or he manually applies vector intrinsics, which is extremely error-prone, difficult to maintain, and not portable at all. Based on whole-function vectorization, a method to replace control flow with data flow, we propose auto-vectorization techniques for image processing DSLs in the context of source-to-source compilation. The approach does not require the input to be available in SSA form. Moreover, we formulate constraints under which the vectorization analysis and code transformations may be greatly simplified in the context of image processing DSLs. As part of our methodology, we present control flow to data flow transformation as a source-to-source translation. Moreover, we propose a method to efficiently analyze algorithms with mixed bit-width data types to determine the optimal SIMD width, independently of the target instruction set. The techniques are integrated into an open source DSL framework. Subsequently, the vectorization capabilities are compared to a variety of existing state-of-the-art C/C++ compilers. A geometric mean speedup of up to 3.14 is observed for benchmarks taken from ISPC and image processing, compared to non-vectorized executions.

application-specific systems, architectures, and processors | 2017

Hardware design and analysis of efficient loop coarsening and border handling for image processing

M. Akif Ozkan; Oliver Reiche; Frank Hannig; Jürgen Teich

Field Programmable Gate Arrays (FPGAs) excel at the implementation of local operators in terms of throughput per energy since the off-chip communication can be reduced with an application-specific on-chip memory configuration. Furthermore, data-level parallelism can efficiently be exploited through socalled loop coarsening, which processes multiple horizontal pixels simultaneously. Moreover, existing solutions for proper border handling in hardware show considerable resource overheads. In this paper, we first propose novel architectures for image border handling and loop coarsening, which can significantly reduce area. Second, we present a systematic analysis of these architectures including the formulation of analytical models for their area usage. Based on these models, we provide an algorithm for suggesting the most efficient hardware architecture for a given specification. Finally, we evaluate several implementations of our proposed architectures obtained through Vivado High-Level Synthesis (HLS). The synthesis results show that the proposed coarsening architecture uses 32% less registers for a 5-by-5 convolution with a 64 coarsening factor compared to previous works, whereas the proposed border handling architectures facilitate a decrease in the Look-up Table (LUT) usage by 36 %.

Explore More