Is this you? Create Your Porfile

Fabian Nowak

Karlsruhe Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fabian Nowak is active.

Explore More

Publication

Featured researches published by Fabian Nowak.

intelligent technologies for interactive entertainment | 2005

Telepresence techniques for controlling avatar motion in first person games

Henning Groenda; Fabian Nowak; Patrick Rößler; Uwe D. Hanebeck

First person games are computer games, in which the user experiences the virtual game world from an avatars view. This avatar is the users alter ego in the game. In this paper, we present a telepresence interface for the first person game Quake III Arena, which gives the user the impression of presence in the game and thus leads to identification with his avatar. This is achieved by tracking the users motion and using this motion data as control input for the avatar. As the user is wearing a head-mounted display and he perceives his actions affecting the virtual environment, he fully immerses into the target environment. Without further processing of the users motion data, the virtual environment would be limited to the size of the users real environment, which is not desirable. The use of Motion Compression, however, allows exploring an arbitrarily large virtual environment while the user is actually moving in an environment of limited size.

high performance embedded architectures and compilers | 2012

Seamlessly portable applications: Managing the diversity of modern heterogeneous systems

Mario Kicherer; Fabian Nowak; Rainer Buchty; Wolfgang Karl

Nowadays, many possible configurations of heterogeneous systems exist, posing several new challenges to application development: different types of processing units usually require individual programming models with dedicated runtime systems and accompanying libraries. If these are absent on an end-user system, e.g. because the respective hardware is not present, an application linked against these will break. This handicaps portability of applications being developed on one system and executed on other, differently configured heterogeneous systems. Moreover, the individual profit of different processing units is normally not known in advance. In this work, we propose a technique to effectively decouple applications from their accelerator-specific parts, respectively code. These parts are only linked on demand and thereby an application can be made portable across systems with different accelerators. As there are usually multiple hardware-specific implementations for a certain task, e.g., a CPU and a GPU version, a method is required to determine which are usable at all and which one is most suitable for execution on the current system. With our approach, application and hardware programmers can express the requirements and the abilities of the application and the hardware-specific implementations in a simplified manner. During runtime, the requirements and abilities are compared with regard to the present hardware in order to determine the usable implementations of a task. If multiple implementations are usable, an online-learning history-based selector is employed to determine the most efficient one. We show that our approach chooses the fastest usable implementation dynamically on several systems while introducing only a negligible overhead itself. Applied to an MPI application, our mechanism enables exploitation of local accelerators on different heterogeneous hosts without preliminary knowledge or modification of the application.

International Journal of Parallel Programming | 2008

Performance advantage of reconfigurable cache design on multicore processor systems

Jie Tao; M. Kunze; Fabian Nowak; Rainer Buchty; Wolfgang Karl

With the trends of microprocessor design towards multicore, cache performance becomes more important because an off-chip access would be increasingly expensive due to the competition across the processor cores. A question arises: How to design the cache architecture to prevent a performance bottleneck caused by data accesses? This work studies a reconfigurable cache architecture that can be dynamically configured for meeting the individual demand of running applications. Using a self-developed cache simulator, we first examined how different cache organization and configuration influence the parallel execution of OpenMP applications. The experimental results show that applications benefit from a flexible cache with reconfigurability. This motivated us to go a step further and develop a hardware prototype of this novel architecture.

Proceedings of the 20th European MPI Users' Group Meeting on | 2013

Multi-parallel prefiltering on the convey HC-1 for supporting homology detection

Fabian Nowak; Michael Bromberger; Martin Schindewolf; Wolfgang Karl

Gene databases used in research are huge and still grow at a fast pace. Many comparisons need to be done when searching similar (homologous) sequences in these databases for a given query sequence. Therefore, highly parallel architectures and much bandwidth are required for handling processing and transferring massive amounts of data. The Convey HC-1 with four FPGAs and high memory bandwidth of up to 76.8 GB/s seems very suitable for supporting this task as other bioinformatics applications have already been greatly supported by the HC-1. We research accelerating an application for searching homologous sequences. Limited by FPGA size only, we present a design that calculates 3 prefiltering scores per FPGA concurrently, i.e. 12 calculations in total. This score calculation for database sequences against the query profile is done by a modified Smith-Waterman scheme that is internally parallelized 16*8=128 times in contrast to the SSE implementation where only 16-fold parallelism can be exploited and where memory bandwidth poses the limiting factor. Preloading the query profile, we are able to transform the memory-bound SSE implementation to a compute-bound FPGA design which is only limited by FPGA size. Despite much lower clock rates, the FPGAs outperform SSE for the calculation of the prefiltering scores by a factor of 4.46. We achieve application speedup of 1.79 against the original, unmodified state-of-the-art SSE-based implementation because the score calculation accounts for less than 63% of the application runtime.

applied reconfigurable computing | 2009

A Seamless Virtualization Approach for Transparent Dynamical Function Mapping Targeting Heterogeneous and Reconfigurable Systems

Rainer Buchty; David Kramer; Fabian Nowak; Wolfgang Karl

Future systems are not only heading towards increased parallelism, but also embrace heterogeneity and reconfigurability. We therefore present an approach targeting comfortable program development and execution, enabling full exploitation of the underlying hardware without burdening the application programmer with the details of the underlying hardware infrastructure. The approach employs lightweight resource virtualization by means of on-demand function resolution. By carefully extending the existing system infrastructure, the approach comes at virtually no cost and with highest compatibility to existing legacy code. The approach is suitable for a wide range of architectures from embedded systems to high-performance computing platforms.

automation, robotics and control systems | 2013

A data-driven approach for executing the CG method on reconfigurable high-performance systems

Fabian Nowak; Ingo Besenfelder; Wolfgang Karl; Mareike Schmidtobreick; Vincent Heuveline

Employing reconfigurable computing systems for numerical applications poses an interesting and promising approach toward increased performance. We study the applicability of the Convey HC-1 for numerical applications by decomposing a preconditioned conjugate gradient (CG) method into several independent kernels that can operate concurrently. To allow overlapped execution and to minimize data transfers, we stream the data between the kernel units using a central buffer set. A microprogrammable control unit orchestrates memory accesses, buffer writes/reads and kernel execution, and allows for further algorithms to be executedon the available kernel units. Solving the Poisson problem can thereby be accelerated up to 10 times compared to a single-threaded software version on the HC-1 and up to 1.2 times compared to a 2-socket hex-core Intel Xeon Westmere system with 24 hardware threads for large problem sizes with only a single application engine.

PARS-Mitteilungen | 2013

Parallel Prefiltering for Accelerating HHblits on the Convey HC-1

Michael Bromberger; Fabian Nowak

HHblits is a bioinformatics application for finding proteins with common ancestors. To achieve more sensitivity, the protein sequences of the query are not compared directly against the database protein sequences, but rather their Hidden Markov Models are compared. Thus, HHblits is very time-consuming and therefore needs to be accelerated. A multi-FPGA system such as the Convey HC-1 is a promising candidate to achieve acceleration. We present the design and implementation of a parallel coprocessor on the Convey HC-1 to accelerate HHblits after analyzing the application toward acceleration candidates. We achieve a speedup of 117.5× against a sequential implementation for FPGA-suitable data sizes per kernel and negligible speedup for the entire uniprot20 protein database against an optimized SSE implementation.

automation, robotics and control systems | 2010

A tightly coupled accelerator infrastructure for exact arithmetics

Fabian Nowak; Rainer Buchty

Processor speed and available computing power constantly increases, enabling computation of more and more complex problems such as numerical simulations of physical processes. In this domain, however, the problem of accuracy arises due to rounding of intermediate results. One solution is to avoid intermediate rounding by using exact arithmetic. The use of FPGAs as application-specific accelerators can speed up such operations compared to their software implementation. In this paper, we present a system approach employing state-of-the art FPGA and interconnection technology for exact arithmetic with double-precision operands, delivering up to 400M exact MACs/s in total and providing a speedup of up to 88 times over competing software implementations in the case of matrix multiplication.

parallel computing | 2007