Heiko Schick | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Heiko Schick is active.

Explore More

Publication

Featured researches published by Heiko Schick.

Computing in Science and Engineering | 2008

QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine

Gottfried Goldrian; Thomas Huth; Benjamin Krill; J. Lauritsen; Heiko Schick; Ibrahim A. Ouda; Simon Heybrock; Dieter Hierl; T. Maurer; Nils Meyer; A. Schäfer; Stefan Solbrig; Thomas Streuer; Tilo Wettig; Dirk Pleiter; Karl-Heinz Sulanke; Frank Winter; H. Simma; Sebastiano Fabio Schifano; R. Tripiccione

Application-driven computers for lattice gauge theory simulations have often been based on system-on-chip designs, but the development costs can be prohibitive for academic project budgets. An alternative approach uses compute nodes based on a commercial processor tightly coupled to a custom-designed network processor. Preliminary analysis shows that this solution offers good performance, but it also entails several challenges, including those arising from the processors multicore structure and from implementing the network processor on a field-programmable gate array.

arXiv: High Energy Physics - Lattice | 2010

QPACE - a QCD parallel computer based on Cell processors

H. Baier; Hans Boettiger; C. Gomez; Dirk Pleiter; Nils Meyer; A. Nobile; Zoltan Fodor; Joerg-Stephan Vogt; K.-H. Sulanke; Simon Heybrock; Frank Winter; U. Fischer; T. Maurer; Thomas Huth; Ibrahim A. Ouda; M. Drochner; Heiko Schick; F. Schifano; A. Schäfer; H. Simma; J. Lauritsen; Norbert Eicker; Marcello Pivanti; Matthias Husken; Thomas Streuer; Gottfried Goldrian; Tilo Wettig; Thomas Lippert; Dieter Hierl; Benjamin Krill

QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system. Furthermore, we provide initial performance results and report on the installation of 8 QPACE racks providing an aggregate peak performance of 200 TFlops.

international supercomputing conference | 2013

Using GPFS to Manage NVRAM-Based Storage Cache

Salem El Sayed; Stephan Graf; Michael Hennecke; D. Pleiter; Georg Schwarz; Heiko Schick; Michael Stephan

I/O performance of large-scale HPC systems grows at a significantly slower rate than compute performance. In this article we investigate architectural options and technologies for a tiered storage system to mitigate this problem. Using GPFS and flash memory cards a prototype is implemented and evaluated. We compare performance numbers obtained by running synthetic benchmarks on a petascale BlueGene/Q system connected to our prototype. Based on these results an assessment of the architecture and technology is performed.

international parallel and distributed processing symposium | 2009

Impact of run-time reconfiguration on design and speed - A case study based on a grid of run-time reconfigurable modules inside a FPGA

Jochen Strunk; Toni Volkmer; Klaus Stephan; Wolfgang Rehm; Heiko Schick

This paper examines the feasibility of utilizing a grid of run-time reconfigurable (RTR) modules on a dynamically and partially reconfigurable (DPR) FPGA. The aim is to create a homogeneous array of RTR regions on a FPGA, which can be reconfigured on demand during run-time. We study its setup, implementation and performance in comparison with its static counterpart. Such a grid of partially reconfigurable regions (PRR) on a FPGA could be used as an accelerator for computers to offload compute kernels or as an enhancement of functionality in the embedded market which uses FPGAs. An in-depth look at the methodology of creating run-time reconfigurable modules and its tools is shown. Due to the lack of the tools in handling hundreds of dynamically reconfigurable regions a framework is presented which supports the user in the creation process of the design. A case study which uses state of the art Xilinx Virtex-5 FPGAs compares the run-time reconfigurable implementation and achievable clock speeds of a grid with up to 47 reconfigurable module regions with its static counterpart. For this examination a high performance module is used, which finds patterns in a bit stream (pattern matcher). This module is replicated for each partially reconfigurable region. Particularly, design considerations for the controller, which manages the modules, are introduced. Beyond this, the paper also addresses further challenges of the implementation of such a RTR grid and limitations of the reconfigurability of Xilinx FPGAs.

arXiv: High Energy Physics - Lattice | 2009

Status of the QPACE Project

H. Baier; Hans Boettiger; Stefan Solbrig; Dirk Pleiter; Nils Meyer; A. Nobile; Zoltan Fodor; K.-H. Sulanke; Simon Heybrock; Frank Winter; U. Fischer; T. Maurer; Thomas Huth; Ibrahim A. Ouda; M. Drochner; Heiko Schick; F. Schifano; H. Simma; J. Lauritsen; Norbert Eicker; Marcello Pivanti; A. Schafer; Thomas Streuer; Gottfried Goldrian; Tilo Wettig; Thomas Lippert; Dieter Hierl; Benjamin Krill; R. Tripiccione; J. McFadden

We give an overview of the QPACE project, which is pursuing the development of a massively parallel, scalable supercomputer for LQCD. The machine is a three-dimensional torus of identical processing nodes, based on the PowerXCell 8i processor. The nodes are connected by an FPGAbased, application-optimized network processor attached to the PowerXCell 8i processor. We present a performance analysis of lattice QCD codes on QPACE and corresponding hardware benchmarks.

applied reconfigurable computing | 2009

ACCFS --- Operating System Integration of Computational Accelerators Using a VFS Approach

Andreas Heinig; Jochen Strunk; Wolfgang Rehm; Heiko Schick

For a number of applications integrating specialized computational accelerators into a general-purpose computing environment yields more performance per watt and per dollar than a pure multi-core approach. In contrast to fully application-specific hybrid solutions we offer the advantage to maintain traditional programming models and development environments to a certain extent. In this paper we introduce an open generic operating system interface concept what we call Accelerator File System (ACCFS) for integrating application accelerators into Linux based platforms. By describing the proposed concepts and interface we contribute to a broader discussion of this challenging topic.

Ibm Journal of Research and Development | 2009

directCell: hybrid systems with tightly coupled accelerators

Hartmut Penner; Utz Bacher; Jan Kunigk; C. Rund; Heiko Schick

The Cell Broadband Engine® (Cell/B.E.) processor is a hybrid IBM PowerPC® processor. In blade servers and PCI Express® card systems, it has been used primarily in a server context, with Linux® as the operating system. Because neither Linux as an operating system nor a PowerPC processor-based architecture is the preferred choice for all applications, some installations use the Cell/B.E. processor in a coupled hybrid environment, which has implications for the complexity of systems management, the programming model, and performance. In the directCell approach, we use the Cell/B.E. processor as a processing device connected to a host via a PCI Express link using direct memory access and memory-mapped I/O (input/output). The Cell/B.E. processor functions as a processor and is perceived by the host like a device while maintaining the native Cell/B.E. processor programming approach. We describe the problems with the current practice that led us to the directCell approach. We explain the challenge in programming, execution control, and operation on the accelerators that were faced during the design and implementation of a prototype and present solutions to overcome them. We also provide an outlook on where the directCell approach promises to better solve customer problems.

digital systems design | 2009

An on Chip Network inside a FPGA for Run-Time Reconfigurable Low Latency Grid Communication

Jochen Strunk; Toni Volkmer; Wolfgang Rehm; Heiko Schick

In this paper a low latency, on chip communication network (NoC) for a run-time reconfigurable (RTR) grid inside dynamically and partially reconfigurable (DPR) FPGAs is proposed, which supports the arbitrary placement of run-time reconfigurable modules (RTRM) inside the grid. The dedicated, fully meshed, silicon network should support the arrangement of communication channels between the RTRMs within the different partially reconfigurable regions (PRRs) on the FPGA. The design of the network guarantees a low latency communication of RTRMs without mutual interference of each other. In comparison with an implementation using FPGA resources the dedicated silicon network could save an huge amount of resources in terms of transistors. The new degree of parallel communication provided for a RTR grid with arbitrarily placeable RTRMs offers new application fields for DPR capable FPGAs. Multiple user applications with inter-communicating offload compute kernels can be loaded on a host coupled FPGA accelerator, a real-time (RT) system with concurrent communication tasks are possible and enhancing the functionality on demand for embedded systems is conceivable. A case study was conducted for proof of concept and for the verification of the run-time environment system, which manages the configurable network.

reconfigurable computing and fpgas | 2010

Communication Architectures for Run-Time Reconfigurable Modules in a 2-D Mesh on FPGAs

Jochen Strunk; Johannes Hiltscher; Wolfgang Rehm; Heiko Schick

This paper examines the feasibility of utilizing a 2-dimensional (2-D) mesh of run-time reconfigurable modules (RTRMs) on a dynamically and partially reconfigurable (DPR) FPGA for throughput- and real-time-driven tasks. To utilize a 2-D mesh of RTRMs, efficient communication architectures (CA) are required, which will be presented in this work. Such a 2-D mesh of RTRMs on a DPR-capable FPGA can be utilized for throughput-driven tasks to dynamically offload compute functions on a host coupled system, providing multi-user and multi-context execution on behalf of user demands. For embedded systems, it can be utilized as a highly dynamical platform by providing functional enhancement by module replacement during run-time. The exploration also includes a CA for real-time communication between RTRMs in a 2-D mesh. The presented CA design is based on a novel methodology by applying run-time reconfiguration to increase the performance. The design, the implementation, the performance and the resource utilization is shown for throughput- and real-time-driven CAs. As proof of concept, a case study is conducted for the presented CAs on state of the art Virtex-5 FPGAs.

reconfigurable computing and fpgas | 2009

Design and Performance of a Grid of Asynchronously Clocked Run-Time Reconfigurable Modules on a FPGA

Jochen Strunk; Toni Volkmer; Wolfgang Rehm; Heiko Schick

This paper examines the feasibility of utilizing a grid of asynchronously clocked run-time reconfigurable modules (RTRMs) on a dynamically and partially reconfigurable (DPR) FPGA. In contrast to a synchronously clocked grid studied in research, the design, the implementation, the performance and the resource utilization of an asynchronously clocked grid is shown. Such a run-time reconfigurable (RTR) grid on a FPGA can be utilized to dynamically offload compute functions on a host coupled system, providing multi-user and multi-context execution on behalf of user demands. For embedded systems it can be utilized as a highly dynamical platform by providing functional enhancement by module replacement during run-time. The presented platform leverages synthesis and development constraints and is able to increase the overall throughput by allowing multiple clock domains within the grid. The performance and the additional resource utilization of handling multiple clock domains is compared to synchronously clocked grids. As proof of concept a case study with a grid of 47 RTRMs is conducted on state of the art Virtex-5 FPGAs.

Explore More