Heiner Giefers | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Heiner Giefers is active.

Explore More

Publication

Featured researches published by Heiner Giefers.

application-specific systems, architectures, and processors | 2014

Analyzing the energy-efficiency of dense linear algebra kernels by power-profiling a hybrid CPU/FPGA system

Heiner Giefers; Raphael Polig; Christoph Hagleitner

It has been shown that FPGA accelerators can outperform pure CPU systems for highly parallel applications and they are considered as a power-efficient alternative to software programmable processors. However, when using FPGA accelerator cards in a server environment multiple sources of power consumption have to get taken into account in order to rate the systems energy-efficiency. In this paper we study the energy-efficiency of a hybrid CPU/FPGA system for a dense linear algebra kernel. We present an FPGA GEMM accelerator architecture that can be tailored to various data types. The performance and energy consumption is compared against tuned, multi-threaded GEMM functions running on the host CPU. We measure the power consumption with internal current/voltage sensors and break down the power draw to the systems components in order to classify the energy consumed by the processor cores, the memory, the I/O bus system and the FPGA card. Our experimental results show that the FPGA-accelerated DGEMM is less energy-efficient than a multi-threaded software implementation with respect to the full systems power consumption, but the most efficient choice when only the dynamic parts of the power are factored in.

arXiv: Emerging Technologies | 2018

Mixed-precision in-memory computing

Manuel Le Gallo; Abu Sebastian; Roland Mathis; Matteo Manica; Heiner Giefers; Tomas Tuma; Costas Bekas; Alessandro Curioni; Evangelos Eleftheriou

As complementary metal–oxide–semiconductor (CMOS) scaling reaches its technological limits, a radical departure from traditional von Neumann systems, which involve separate processing and memory units, is needed in order to extend the performance of today’s computers substantially. In-memory computing is a promising approach in which nanoscale resistive memory devices, organized in a computational memory unit, are used for both processing and memory. However, to reach the numerical accuracy typically required for data analytics and scientific computing, limitations arising from device variability and non-ideal device characteristics need to be addressed. Here we introduce the concept of mixed-precision in-memory computing, which combines a von Neumann machine with a computational memory unit. In this hybrid system, the computational memory unit performs the bulk of a computational task, while the von Neumann machine implements a backward method to iteratively improve the accuracy of the solution. The system therefore benefits from both the high precision of digital computing and the energy/areal efficiency of in-memory computing. We experimentally demonstrate the efficacy of the approach by accurately solving systems of linear equations, in particular, a system of 5,000 equations using 998,752 phase-change memory devices.A hybrid system that combines a von Neumann machine with a computational memory unit can offer both the high precision of digital computing and the energy/areal efficiency of in-memory computing, which is illustrated by accurately solving a system of 5,000 equations using 998,752 phase-change memory devices.

field programmable logic and applications | 2014

Compiling text analytics queries to FPGAs

Raphael Polig; Kubilay Atasu; Heiner Giefers; Laura Chiticariu

Extracting information from unstructured text data is a compute-intensive task. The performance of general-purpose processors cannot keep up with the rapid growth of textual data. Therefore we discuss the use of FPGAs to perform large scale text analytics. We present a framework consisting of a compiler and an operator library capable of generating a Verilog processing pipeline from a text analytics query specified in the annotation query language AQL. The operator library comprises a set of configurable modules capable of performing relational and extraction tasks which can be assembled by the compiler to represent a full annotation operator graph. Leveraging the nature of text processing we show that most tasks can be performed in an efficient streaming fashion. We evaluate the performance, power consumption and hardware utilization of our approach for a set of different queries compiled to a Stratix IV FPGA. Measurements show an up to 79 times improvement of document-throughput over a 64 threaded software implementation on a POWER7 server. Moreover the accelerated systems energy efficiency is up to 85 times better.

ACM Sigarch Computer Architecture News | 2013

Accelerating finite difference time domain simulations with reconfigurable dataflow computers

Heiner Giefers; Christian Plessl; Jens Förstner

Finite difference methods are widely used, highly parallel algorithms for solving differential equations. However, the algorithms are memory bound and thus difficult to implement efficiently on CPUs or GPUs. In this work we study the implementation of the finite difference time domain (FDTD) method for solving Maxwells equations on an FPGA-based Maxeler dataflow computer. We evaluate our work with actual problems from the domain of computational nanophotonics. The use of realistic simulations requires us to pay special attention to boundary conditions (Dirichlet, periodic, absorbing), which are critical for the correctness of results but detrimental to the performance and thus frequently neglected. We discuss and evaluate the design of two different FDTD implementations, which outperform CPU and GPU implementations. To our knowledge, our implementation is the fastest FPGA-based FDTD solver.

design, automation, and test in europe | 2015

Accelerating arithmetic kernels with coherent attached FPGA coprocessors

Heiner Giefers; Raphael Polig; Christoph Hagleitner

The energy efficiency of computer systems can be increased by migrating computational kernels that are known to under-utilize the CPU to an FPGA based coprocessor. In contrast to traditional I/O-based coprocessors that require explicit data movement, coherently attached accelerators can operate on the same virtual address space than the host CPU. A shared memory organization enables widely accepted programming models and helps to deploy energy efficient accelerators in general purpose computing systems. In this paper we study an FFT accelerator on FPGA attached via the Coherent Accelerator Processor Interface (CAPI) to a POWER8 processor. Our results show that the coherent attached accelerator outperforms device driver based approaches in terms of latency. Hardware acceleration delivers a 5× gain in energy efficiency compared to an optimized parallel software FFT running on a 12-core CPU and improves single thread performance by more than 2×. We conclude that the integration of CAPI into heterogeneous programming frameworks such as OpenCL will facilitate latency critical operations and will further enhance programmability of hybrid systems.

international symposium on performance analysis of systems and software | 2016

Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA

Heiner Giefers; Peter W. J. Staar; Costas Bekas; Christoph Hagleitner

Hardware accelerators have evolved as the most prominent vehicle to meet the demanding performance and energy-efficiency constraints of modern computer systems. The prevalent type of hardware accelerators in the high-performance computing domain are PCIe attached co-processors to which the CPU can offload compute intensive tasks. In this paper, we analyze the performance, power, and energy-efficiency of such accelerators for sparse matrix multiplication kernels. Improving the efficiency for sparse matrix operations is of eminent importance since they work at the core of graph analytics algorithms which are in turn key to many big data knowledge discovery workloads. Our study involves GPU, Xeon Phi, and FPGA co-processors to embrace the vast majority of hardware accelerators applied in modern HPC systems. In order to compare the devices on the same level of implementation quality we apply vendor optimized libraries for which published results exist. From our experiments we deduce that none of the compared devices generally dominates in terms of energy-efficiency and that the optimal solutions depends on the actual sparse matrix data, data transfer requirements and on the applied efficiency metric. We also show that a combined use of multiple accelerators can further improve the systems performance and efficiency by up to 11% and 18%, respectively.

IEEE Transactions on Computers | 2014

An FPGA-Based Reconfigurable Mesh Many-Core

Heiner Giefers; Marco Platzner

The reconfigurable mesh is a parallel model of computation, which exploits a massive amount of rather simple processing elements connected through a reconfigurable interconnection network. During the last decades, the model received strong interest and many researchers have devised algorithms for it. However, most of this work focuses on theoretical aspects. Due to some idealistic modeling assumptions only a few attempts have been made to implement the model and to study the practical use of reconfigurable meshes. In this paper, we leverage the reconfigurable mesh model to study potential architectures and programming models for future many-cores. We design a reconfigurable mesh in form of a scalable soft core array with a reconfigurable interconnect and implement it on FPGA technology in order to create a prototype platform. We present an overall hardware/software tool flow for generating and programming reconfigurable mesh prototypes. The new language ARMLang and a corresponding compiler facilitate the programming of the massively parallel processor arrays. To our knowledge, this work is the first practical study of word-level reconfigurable meshes. To analyze the performance of our implementation we study four algorithmic kernels from the application domains arithmetic, sorting, graph algorithms and imaging. For each kernel, we devise a reconfigurable mesh program in ARMLang, compile it to our soft core array and measure its runtime depending on the mesh size. Then, we compare the runtimes to two sequential implementations of the algorithms, which are executed on two single core systems. The results show that many-cores leveraging the reconfigurable mesh model can efficiently use a vast number of processing elements and that, for the chosen algorithms, they come close to optimally parallelized programs.

international parallel and distributed processing symposium | 2016

Stochastic Matrix-Function Estimators: Scalable Big-Data Kernels with High Performance

Peter W. J. Staar; Panagiotis Kl. Barkoutsos; Roxana Istrate; A. Cristiano I. Malossi; Ivano Tavernelli; Nikolaj Moll; Heiner Giefers; Christoph Hagleitner; Costas Bekas; Alessandro Curioni

In this era of Big Data, large graphs appear in many scientific domains. To extract the hidden knowledge/correlations in these graphs, novel methods need to be developed to analyse these graphs fast. In this paper, we present a unified framework of stochastic matrix-function estimators, which allows one to compute a subset of elements of the matrix f(A), where f is an arbitrary function and A is the adjacency matrix of the graph. The new framework has a computational cost proportional to the size of the subset, i.e. to obtain the diagonal of f(A) with matrix-size N, the computational cost is proportional to N contrary to the traditional N^3 from diagonalization. Furthermore, we will show that the new framework allows us to write implementations of the algorithm that scale naturally with the number of compute nodes and is easily ported to accelerators where the kernels perform very well.

field programmable logic and applications | 2016

Energy-efficient stochastic matrix function estimator for graph analytics on FPGA

Heiner Giefers; Peter W. J. Staar; Raphael Polig

Big Data applications require efficient processing of large graphs to unveil information that is hidden in the structural relationships among objects. In order to cope with the growing complexity of data sets many graph algorithms can be expressed to apply linear algebra operations for which highly efficient algorithms exist. In this paper we present an FPGA implementation of a stochastic matrix function estimator, a powerful framework for statistical approximation of general matrix functions. We apply the accelerator to the subgraph centrality method for ranking nodes in complex networks. Performance and energy consumption results are based on actual measurements of a POWER8 hybrid compute platform. A single FPGA co-processor improves the runtime by more than 50% compared to multi-threaded software while delivering the same estimation quality. In terms of energy consumption the FPGA outperforms CPU and GPU solutions by a factor of 13× and 3×, respectively. Our results show that FPGA co-processors can provide significant gains for graph analytics applications and are a promising solution for energy efficient computing in the data center.

application-specific systems, architectures, and processors | 2015

A soft-core processor array for relational operators

Raphael Polig; Heiner Giefers; Walter Stechele

Despite the performance and power efficiency gains achieved by FPGAs for text analytics queries, analysis shows a low utilization of the custom hardware operator modules. Furthermore the long synthesis times limit the accelerators use in enterprise systems to static queries. To overcome these limitations we propose the use of an overlay architecture to share area resources among multiple operators and reduce compilation times. In this paper we present a novel soft-core architecture tailored to efficiently perform relational operations of text analytics queries on multiple virtual streams. It combines the ability to perform efficient streaming based operations while adding the flexibility of an instruction programmable core. It is used as a processing element in an array of cores to execute large query graphs and has access to shared co-processors to perform string-and context-based operations. We evaluate the core architecture in terms of area and performance compared to the custom hardware modules, and show how a minimum number of cores can be calculated to avoid stalling the document processing.

Explore More