D. Rohr | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where D. Rohr is active.

Explore More

Publication

Featured researches published by D. Rohr.

Journal of Physics: Conference Series | 2012

ALICE HLT TPC Tracking of Pb-Pb Events on GPUs

D. Rohr; A. Szostak; M. Kretz; T. Kollegger; T. Breitner; T. Alt; S. Gorbunov

The online event reconstruction for the ALICE experiment at CERN requires processing capabilities to process central Pb-Pb collisions at a rate of more than 200 Hz, corresponding to an input data rate of about 25 GB/s. The reconstruction of particle trajectories in the Time Projection Chamber (TPC) is the most compute intensive step. The TPC online tracker implementation combines the principle of the cellular automaton and the Kalman filter. It has been accelerated by the usage of graphics cards (GPUs). A pipelined processing allows to perform the tracking on the GPU, the data transfer, and the preprocessing on the CPU in parallel. In order for CPU pre- and postprocessing to keep step with the GPU the pipeline uses multiple threads. A splitting of the tracking in multiple phases searching for short local track segments first improves data locality and makes the algorithm suited to run on a GPU. Due to special optimizations this course of action is not second to a global approach. Because of non-associative floating-point arithmetic a binary comparison of GPU and CPU tracker is infeasible. A track by track and cluster by cluster comparison shows a concordance of 99.999%. With current hardware, the GPU tracker outperforms the CPU version by about a factor of three leaving the processor still available for other tasks.

ieee international conference on high performance computing data and analytics | 2015

Node variability in large-scale power measurements: perspectives from the Green500, Top500 and EEHPCWG

Tom Scogland; Jonathan J. Azose; D. Rohr; Suzanne Rivoire; Natalie J. Bates; Daniel Hackenberg

The last decade has seen power consumption move from an afterthought to the foremost design constraint of new supercomputers. Measuring the power of a supercomputer can be a daunting proposition, and as a result, many published measurements are extrapolated. This paper explores the validity of these extrapolations in the context of inter-node power variability and power variations over time within a run. We characterize power variability across nodes in systems at eight supercomputer centers across the globe. This characterization shows that the current requirement for measurements submitted to the Green500 and others is insufficient, allowing variations of up to 20% due to measurement timing and a further 10--15% due to insufficient sample sizes. This paper proposes new power and energy measurement requirements for supercomputers, some of which have been accepted for use by the Green500 and Top500, to ensure consistent accuracy.

ieee international conference on high performance computing, data, and analytics | 2015

Lattice-CSC: Optimizing and Building an Efficient Supercomputer for Lattice-QCD and to Achieve First Place in Green500

D. Rohr; M. Bach; Gvozden Neskovic; Volker Lindenstruth; Christopher Pinke; Owe Philipsen

In the last decades, supercomputers have become a necessity in science and industry. Huge data centers consume enormous amounts of electricity and we are at a point where newer, faster computers must no longer drain more power than their predecessors. The fact that user demand for compute capabilities has not declined in any way has led to studies of the feasibility of exaflop systems. Heterogeneous clusters with highly-efficient accelerators such as GPUs are one approach to higher efficiency. We present the new L-CSC cluster, a commodity hardware compute cluster dedicated to Lattice QCD simulations at the GSI research facility. L-CSC features a multi-GPU design with four FirePro S9150 GPUs per node providing 320 GB/s memory bandwidth and 2.6 TFLOPS peak performance each. The high bandwidth makes it ideally suited for memory-bound LQCD computations while the multi-GPU design ensures superior power efficiency. The November 2014 Green500 list awarded L-CSC the most power-efficient supercomputer in the world with 5270 MFLOPS/W in the Linpack benchmark. This paper presents optimizations to our Linpack implementation HPL-GPU and other power efficiency improvements which helped L-CSC reach this benchmark. It describes our approach for an accurate Green500 power measurement and unveils some problems with the current measurement methodology. Finally, it gives an overview of the Lattice QCD application on L-CSC.

parallel, distributed and network-based processing | 2013

A Comprehensive Approach for a Power Efficient General Purpose Supercomputer

M. Bach; J. de Cuveland; H. Ebermann; D. Eschweiler; J. Gerhard; S. Kalcher; M. Kretz; Volker Lindenstruth; H J Ludde; M. Pollok; D. Rohr

Computers are essential in research and industry, but they are also significant contributors to the worldwide power consumption. The LOEWE-CSC supercomputer addresses this problem by setting new standards in environmental compatibility as well as energy and cooling efficiency for high-performance and general-purpose computing. Designing a pervasively energy efficient compute center requires improvements in multiple fields. The hosting low-loss compute-center operates at a cooling overhead below 8% of the computer power. General purpose graphics processing units provide more compute performance per watt than standard processors. A balanced hardware configuration ensures that most of the compute power is available to the user when he employs optimized applications. Clever algorithms enable the user to fully exploit the computational potential and avoids to waste power when the processors idles, which is often a cause of inefficient programming. The LOEWE-CSC operated at 740MFlops/W during a Linpack benchmark run, by using commodity servers and ranked place 8 in the Green500 list of November 2010. These innovations provide a fundamental step towards cost-effective, environment-friendly exascale computing and IT operation.

Journal of Physics: Conference Series | 2014

O2: A novel combined online and offline computing system for the ALICE Experiment after 2018

Ananya; A Alarcon Do Passo Suaide; C. Alves Garcia Prado; T. Alt; L. Aphecetche; N Agrawal; A Avasthi; M. Bach; R. Bala; G. G. Barnaföldi; A. Bhasin; J. Belikov; F. Bellini; L. Betev; T. Breitner; P. Buncic; F. Carena; S. Chapeland; V. Chibante Barroso; F Cliff; F. Costa; L Cunqueiro Mendez; Sadhana Dash; C Delort; E. Dénes; R. Divià; B. Doenigus; H. Engel; D. Eschweiler; U. Fuchs

ALICE (A Large Ion Collider Experiment) is a detector dedicated to the studies with heavy ion collisions exploring the physics of strongly interacting nuclear matter and the quark-gluon plasma at the CERN LHC (Large Hadron Collider). After the second long shutdown of the LHC, the ALICE Experiment will be upgraded to make high precision measurements of rare probes at low pT, which cannot be selected with a trigger, and therefore require a very large sample of events recorded on tape. The online computing system will be completely redesigned to address the major challenge of sampling the full 50 kHz Pb-Pb interaction rate increasing the present limit by a factor of 100. This upgrade will also include the continuous un-triggered read-out of two detectors: ITS (Inner Tracking System) and TPC (Time Projection Chamber)) producing a sustained throughput of 1 TB/s. This unprecedented data rate will be reduced by adopting an entirely new strategy where calibration and reconstruction are performed online, and only the reconstruction results are stored while the raw data are discarded. This system, already demonstrated in production on the TPC data since 2011, will be optimized for the online usage of reconstruction algorithms. This implies much tighter coupling between online and offline computing systems. An R&D program has been set up to meet this huge challenge. The object of this paper is to present this program and its first results.

Journal of Instrumentation | 2016

The ALICE high-level trigger read-out upgrade for LHC Run 2

H. Engel; T. Alt; T. Breitner; A. Gomez Ramirez; T. Kollegger; Mikolaj Krzewicki; J Lehrbach; D. Rohr; U. Kebschull

The ALICE experiment uses an optical read-out protocol called Detector Data Link (DDL) to connect the detectors with the computing clusters of Data Acquisition (DAQ) and High-Level Trigger (HLT). The interfaces of the clusters to these optical links are realized with FPGA-based PCI-Express boards. The High-Level Trigger is a computing cluster dedicated to the online reconstruction and compression of experimental data. It uses a combination of CPU, GPU and FPGA processing. For Run 2, the HLT has replaced all of its previous interface boards with the Common Read-Out Receiver Card (C-RORC) to enable read-out of detectors at high link rates and to extend the pre-processing capabilities of the cluster. The new hardware also comes with an increased link density that reduces the number of boards required. A modular firmware approach allows different processing and transport tasks to be built from the same source tree. A hardware pre-processing core includes cluster finding already in the C-RORC firmware. State of the art interfaces and memory allocation schemes enable a transparent integration of the C-RORC into the existing HLT software infrastructure. Common cluster management and monitoring frameworks are used to also handle C-RORC metrics. The C-RORC is in use in the clusters of ALICE DAQ and HLT since the start of LHC Run 2.

high performance computing and communications | 2015

A Load-Distributed Linpack Implementation for Heterogeneous Clusters

D. Rohr; Volker Lindenstruth

In recent years, heterogeneous HPC systems, whichcombine traditional processors with accelerator cards such as GPUs, have been shown to deliver superior performance and power efficiency. Since different scientific problems pose different demands on the computer architecture, some general purpose supercomputers consist of different types of nodes, where each type is suited best for certain applications. Such clusters with inter-node heterogeneity (different types of nodes) on top of intra-node heterogeneity (different processors inside one node) consist of compute nodes with different compute performances. The standard implementation of the Linpack benchmark, HPL, distributes the workload evenly among all processes and thus cannot exploit the clusters full potential if the nodes have unequalperformance. This paper presents a new feature of our HPL-GPU implementation which allows a balanced fine-tuned workload distribution among all compute nodes taking into account their individual compute capabilities. We present results on some nodes of different speed-grades on the LOEWE-CSC cluster and demonstrate that our implementation can utilize all nodes of a heterogeneous configuration efficiently showing only about 3% granularity loss.

international conference on cluster computing | 2017

Fast Failure Erasure Encoding Using Just in Time Compilation for CPUs, GPUs, and FPGAs

D. Rohr; Volker Lindenstruth

Failure tolerant data encoding and storage is of paramount importance for data centers, supercomputers, data transfers, and many aspects of information technology. Reed-Solomon failure erasure codes and their variants are the basis for many applications in this field. Efficient implementation of these codes is challenging because they require computations in Galois fields, which are not supported by processors natively. Improved variants such as the Cauchy-Reed-Solomon code enable a better mapping of the required calculations to computer instructions. However, this works best when the source code of the application is tuned for fixed encoding parameters which deteriorates the flexibility. We present an approach to overcoming these limitations by just in time compiling optimized code for arbitrary encoding settings. Our open source library is optimized for x86 processors using SSE and AVX extensions and we present prototypes for GPUs and FPGAs as well. For a significant range of encoding parameters, our implementation encodes at the maximum bandwidth the processor can read the data from memory. In more complicated settings with additional redundancy data to compensate the failure of multiple data stores, the algorithm becomes compute limited. The optimized JIT code leverages the full potential of modern processors reaching an instruction throughput of more than three SIMD-instructions per compute cycle, and encodes up to 19 gigabytes of data per second on a Skylake system.

arXiv: Instrumentation and Detectors | 2017

IOP : Improvements of the ALICE HLT data transport framework for LHC Run 2

D. Rohr; Mikolaj Krzwicki; Heiko Engel; Johannes Lehrbach; Volker Lindenstruth

The ALICE HLT uses a data transport framework based on the publisher-subscriber message principle, which transparently handles the communication between processing components over the network and between processing components on the same node via shared memory with a zero copy approach. We present an analysis of the performance in terms of maximum achievable data rates and event rates as well as processing capabilities during Run 1 and Run 2. Based on this analysis, we present new optimizations we have developed for ALICE in Run 2. These include support for asynchronous transport via Zero-MQ which enables loops in the reconstruction chain graph and which is used to ship QA histograms to DQM. We have added asynchronous processing capabilities in order to support long-running tasks besides the event-synchronous reconstruction tasks in normal HLT operation. These asynchronous components run in an isolated process such that the HLT as a whole is resilient even to fatal errors in these asynchronous components. In this way, we can ensure that new developments cannot break data taking. On top of that, we have tuned the processing chain to cope with the higher event and data rates expected from the new TPC readout electronics (RCU2) and we have improved the configuration procedure and the startup time in order to increase the time where ALICE can take physics data. We analyze the maximum achievable data processing rates taking into account processing capabilities of CPUs and GPUs, buffer sizes, network bandwidth, the incoming links from the detectors, and the outgoing links to data acquisition.

international conference on cluster computing | 2016

A Model for Weak Scaling to Many GPUs at the Basis of the Linpack Benchmark

D. Rohr; Jan De Cuveland; Volker Lindenstruth

Today, accelerator cards like GPUs are an important constituent of HPC clusters. For certain GPU-intense applications, the trend is shifting toward multi-GPU systems with four or more GPUs per compute node. This can increase the performance per dollar and the performance per watt. The Linpack benchmark is the standard tool for measuring the compute performance of supercomputers. Its standard implementation, HPL, cannot make use of GPUs on its own and other GPU-enabled versions of HPL show reduced efficiency on multi-GPU systems with four or more GPUs per node. In previous efforts we have developed a GPU-accelerated Linpack implementation in particular for AMD GPUs, which we employed on several clusters of a couple of hundred nodes each. Gustafsons law predicts perfect weak scaling with the number of processor cores, but it is not directly applicable to the number of GPUs in a system due to shared resources. In this paper we develop a model for general relations between the number of GPUs and the minimum problem size required to use them efficiently taking into account the limited PCIe bandwidth, memory bandwidth, and CPU resources. Based on this, we present an approach to scale our Linpack implementation to eight GPUs and achieve good GPU utilization in Linpack on future systems. Finally, we examine how energy efficiency, which has improved significantly with dual-and quad-GPU servers, scales to many-GPU systems.

Explore More