Frank Feinbube | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Frank Feinbube is active.

Explore More

Publication

Featured researches published by Frank Feinbube.

international symposium on parallel and distributed computing | 2010

NQueens on CUDA: Optimization Issues

Frank Feinbube; Bernhard Rabe; Martin von Löwis; Andreas Polze

Todays commercial off-the-shelf computer systems are multicore computing systems as a combination of CPU, graphic processor (GPU) and custom devices. In comparison with CPU cores, graphic cards are capable to execute hundreds up to thousands compute units in parallel. To benefit from these GPU computing resources, applications have to be parallelized and adapted to the target architecture. In this paper we show our experience in applying the NQueens puzzle solution on GPUs using Nvidias CUDA (Compute Unified Device Architecture) technology. Using the example of memory usage and memory access, we demonstrate that optimizations of CUDA programs may have contrary results on different CUDA architectures. Evaluation results will point out, that it is not sufficient to use new programming languages or compilers to achieve best results with emerging graphic card computing.

pervasive technologies related to assistive environments | 2008

Predictable interactive control of experiments in a service-based remote laboratory

Andreas Rasche; Frank Feinbube; Peter Tröger; Bernhard Rabe; Andreas Polze

Remote and virtual laboratories are commonly used in electronic engineering and computer science to provide hands-on experience for students. Web services have lately emerged as a standardized interfaces to remote laboratory experiments and simulators. One drawback of direct Web service interfaces to experiments is that the connected hardware could be damaged due to missed deadlines of the remotely executed control applications. Within this paper, we suggest an architecture for predictable and interactive control of remote laboratory experiments accessed over Web service protocols. We present this concept as an extension of our existing Distributed Control Lab infrastructure. Using our architecture, students can conduct complex control experiments on physical experiments remotely without harming hardware installations.

international symposium on computing and networking | 2015

Using Dynamic Parallelism for Fine-Grained, Irregular Workloads: A Case Study of the N-Queens Problem

Max Plauth; Frank Feinbube; Frank Schlegel; Andreas Polze

GPU compute devices have become very popular for general purpose computations. However, the SIMD-like hardware of graphics processors is currently not well suited for irregular workloads, like searching unbalanced trees. In order to mitigate this drawback, NVIDIA introduced an extension to GPU programming models called dynamic parallelism. This extension enables GPU programs to spawn new units of work directly on the GPU, allowing the refinement of subsequent work items based on intermediate results without any involvement of the main CPU. This work investigates methods for employing dynamic parallelism with the goal of improved workload distribution for tree search algorithms on modern GPU hardware. For the evaluation of the proposed approaches, a case study is conducted on the n-queens problem. Extensive benchmarks indicate that the benefits of improved resource utilization fail to outweigh high management overhead and runtime limitations due to the very fine level of granularity of the investigated problem. However, novel memory management concepts for passing parameters to child grids are presented. These general concepts are applicable to other, more coarse-grained problems that benefit from the use of dynamic parallelism.

international parallel and distributed processing symposium | 2016

Parallel Implementation Strategies for Hierarchical Non-uniform Memory Access Systems by Example of the Scale-Invariant Feature Transform Algorithm

Max Plauth; Wieland Hagen; Frank Feinbube; Felix Eberhardt; Lena Feinbube; Andreas Polze

The domains of parallel and distributed computing have been converging continuously up to the degree that state-of-the-art server computer systems incorporate characteristics from both domains: They comprise a hierarchy of enclosures, where each enclosure houses multiple processor sockets and each socket again contains multiple memory controllers. A global address space and cache coherency are facilitated using multiple layers of fast interconnection technologies even across enclosures. The growing popularity of such systems creates an urge for efficient mappings of cardinal algorithms onto such hierarchical architectures. However, the growing complexity of such systems and the inconsistencies between implementation strategies of different hardware vendors make it increasingly harder to do find efficient mapping strategies that are universally valid. In this paper, we present scalable optimization and mapping strategies in a case study of the popular Scale-Invariant Feature Transform (SIFT) computer vision algorithm. Our approaches are evaluated using a state-of-the-art hierarchical Non-Uniform Memory Access (NUMA) system with 240 physical cores and 12 terabytes of memory, apportioned across 16 NUMA nodes (sockets). SIFT is particularly interesting since the algorithm utilizes a variety of common data access patterns, thus allowing us to discuss the scaling properties of optimization strategies from the distributed and parallel computing domains and their applicability on emerging server systems.

field-programmable custom computing machines | 2013

Leveraging Hybrid Hardware in New Ways - The GPU Paging Cache

Frank Feinbube; Peter Tröger; Johannes Henning; Andreas Polze

Modern server and desktop systems combine multiple computational cores and accelerator devices into a hybrid architecture. GPUs as one class of such devices provide dedicated processing power and memory capacities for data parallel computation of 2D and 3D graphics. Although these cards have demonstrated their applicability in a variety of areas, they are almost exclusively used by special purpose software. If such software is not running, the accelerator resources of the hybrid system remain unused. In this paper, we present an operating system extension that allows leveraging the GPU accelerator memory for operating system purposes. Our approach utilizes graphics card memory as cache for virtual memory pages, which can improve the overall system responsiveness, especially under heavy load. Our prototypical implementation for Windows proves the potential of such an approach, but identifies also significant preconditions for a widespread adoption in desktop systems.

international symposium on computing and networking | 2016

PGASUS: A Framework for C++ Application Development on NUMA Architectures

Wieland Hagen; Max Plauth; Felix Eberhardt; Frank Feinbube; Andreas Polze

For the implementation of data-intensive C++ applications for cache coherent Non-Uniform Memory Access (NUMA) systems, both massive parallelism and data locality have to be considered. While massive parallelism has been largely understood, the shared memory paradigm is still deeply entrenched in the mindset of many C++ software developers. Hence, data locality aspects of NUMA systems have been widely neglected thus far. At first sight, applying shared nothing approaches might seem like a viable workaround to address locality. However, we argue that developers should be enabled to address locality without having to surrender the advantages of the shared address space of cache coherent NUMA systems. Based on an extensive review of parallel programming languages and frameworks, we propose a programming model specialized for NUMA-aware C++ development that incorporates essential mechanisms for parallelism and data locality. We suggest that these mechanisms should be used to implement specialized data structures and algorithm templates which encapsulate locality, data distribution, and implicit data parallelism. We present an implementation of the proposed programming model in the form of a C++ framework. To demonstrate the applicability of our programming model, we implement a prototypical application on top of this framework and evaluate its performance.

parallel and distributed computing: applications and technologies | 2014

Fast ICA on Modern GPU Architectures

Max Plauth; Frank Feinbube; Peter Tröger; Andreas Polze

Blind Signal Separation is an algorithmic problem class that deals with the restoration of original signal data from a signal mixture. Implementations, such as Fast ICA, are optimized for parallelization on CPU or first-generation GPU hardware. With the advent of modern, compute centered GPU hardware with powerful features such as dynamic parallelism support, these solutions no longer leverage the available hardware performance in the best-possible way. We present an optimized implementation of the FastICA algorithm, which is specifically tailored for next-generation GPU architectures such as Nvidia Kepler. Our proposal achieves a two digit factor of speedup in the prototype implementation, compared to a multithreaded CPU implementation. Our custom matrix multiplication kernels, tailored specifically for the use case, contribute to the speedup by delivering better performance than the state-of-the-art CUBLAS library.

european conference on parallel processing | 2014

Scalable SIFT with Scala on NUMA

Frank Feinbube; Lena Herscheid; Christoph Neijenhuis; Peter Tröger

Scale-invariant feature transform SIFT is an algorithm to identify and track objects in a series of digital images. The algorithm can handle objects that change their location, scale, rotation or illumination in subsequent images. This makes SIFT an ideal candidate for object tracking --- typically denoted as feature detection --- problems in computer imaging applications. The complexity of the SIFT approach often forces developers and system architects to rely on less efficient heuristic approaches for object detection when streaming video data. This makes the algorithm a promising candidate for new parallelization strategies in heterogeneous parallel environments. With this article, we describe our thorough performance analysis of various SIFT implementation strategies in the Scala programming language. Scala supports the development of mixed-paradigm parallel code that targets shared memory systems as well as distributed environments. Our proposed SIFT implementation strategy takes both caching and non-uniform memory architecture NUMA into account, and therefore achieves a higher speedup factor than existing work. We also discuss how scalability for larger video workloads can be achieved by leveraging the actor programming model as part of a distributed SIFT implementation in Scala.

international parallel and distributed processing symposium | 2017

Assessing NUMA Performance Based on Hardware Event Counters

Max Plauth; Christoph Sterz; Felix Eberhardt; Frank Feinbube; Andreas Polze

Cost models play an important role for the efficient implementation of software systems. These models can be embedded in operating systems and execution environments to optimize execution at run time. Even though non-uniform memory access (NUMA) architectures are dominating todays server landscape, there is still a lack of parallel cost models that represent NUMA system sufficiently. Therefore, the existing NUMA models are analyzed, and a two-step performance assessment strategy is proposed that incorporates low-level hardware counters as performance indicators. To support the two-step strategy, multiple tools are developed, all accumulating and enriching specific hardware event counter information, to explore, measure, and visualize these low-overhead performance indicators. The tools are showcased and discussed alongside specific experiments in the realm of performance assessment.

european conference on parallel processing | 2017

Data Partitioning Strategies for Stencil Computations on NUMA Systems

Frank Feinbube; Max Plauth; Marius Knaust; Andreas Polze

Many scientific problems rely on the efficient execution of stencil computations, which are usually memory-bound. In this paper, stencils on two-dimensional data are executed on NUMA architectures. Each node of a NUMA system processes a distinct partition of the input data independent from other nodes. However, processors may need access to the memory of other nodes at the edges of the partitions. This paper demonstrates two techniques based on machine learning for identifying partitioning strategies that reduce the occurrence of remote memory access. One approach is generally applicable and is based on an uninformed search. The second approach caps the search space by employing geometric decomposition. The partitioning strategies obtained with these techniques are analyzed theoretically. Finally, an evaluation on a real NUMA machine is conducted, which demonstrates that the expected reduction of the remote memory accesses can be achieved.

Explore More