Ulrich Bruening
Heidelberg University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ulrich Bruening.
international conference on parallel processing | 2008
Heiner Litz; Holger Froening; Mondrian Nuessle; Ulrich Bruening
This paper presents a novel stateless, virtualized communication engine for sub-microsecond latency. Using a field-programmable-gate-array (FPGA) based prototype we show a latency of 970 ns between two machines with our virtualized engine for low overhead (VELO). The FPGA device is directly connected to the CPUs by a hypertransport link. The described hardware architecture is optimized for small messages and avoids the overhead typically found with direct-memory access (DMA) controlled transfers. The stateless approach allows to use the hardware unit directly from many threads and processes simultaneously. It provides a secure user level communication with an extremely optimized start-up phase. Micro benchmarks results are reported both based on proprietary API and OpenMPI basis.
radio frequency integrated circuits symposium | 2012
Bastian Mohr; Niklas Zimmermann; Bjoern Thorsten Thiel; Jan Henning Mueller; Yifan Wang; Ye Zhang; Frank Lemke; Richard Leys; Sven Schenk; Ulrich Bruening; Renato Negra; Stefan Heinen
This paper presents an RFDAC based transmitter for wireless mobile and connectivity applications in a 65 nm CMOS technology. The transmitter RFDAC has a segmented architecture employing 4 LSB and 16 MSB unit cells for each I and Q path, thus providing a resolution of 8 bit + signum. Switchable LO drivers and unit cells with current shutdown are used to reduce power dissipation when transmitting signals with high PAPR such as IEEE 802.11 (WLAN) or 3GPP Long Term Evolution (LTE). The frontend is capable of transmitting an 64 QAM-OFDM WLAN signal at a center frequency of 1 GHz with an output power of -8 dBm and an EVM of 4.66 %. Analog power dissipation is 34 mW, clock and LO divider use less than 10 mW, and the digital block consumes about 87 mW. The area of the frontend is about 0.4 mm2.
IEEE Transactions on Nuclear Science | 2010
Frank Lemke; David Slogsnat; Niels Burkhardt; Ulrich Bruening
This paper focuses on the interconnection network used as a part of the Data Acquisition System of the Compressed Baryonic Matter experiment at the Facility for Antiproton and Ion Research in Darmstadt, Germany. This experiment will have special demands on the Data Acquisition System like limited space for hardware, radiation tolerance, flexibility for different types of network traffic and support for synchronization mechanisms. The specialty of the CBM network is that it uses only a single bidirectional fiber link for all network abilities and providing a deterministic latency message for precise time synchronisation. This led to the development of a new network and protocol.
ieee-npss real-time conference | 2009
Frank Lemke; David Slogsnat; Niels Burkhardt; Ulrich Bruening
This paper focuses on the interconnection network used as a part of the Data Acquisition System of the Compressed Baryonic Matter experiment at the Facility for Antiproton and Ion Research in Darmstadt. This experiment will have special demands on the Data Acquisition System like limited space for hardware, radiation tolerance, flexible enough for all required operation modes of the detectors and support for synchronization mechanisms. The specials of the CBM networks are using only a single bidirectional fiber link for all network abilities and providing a deterministic latency message for precise time synchronisation. This led to the development of a new network and protocol.
high-performance computer architecture | 2015
Sarah Neuwirth; Dirk Frey; Mondrian Nuessle; Ulrich Bruening
On the road to Exascale computing, novel communication architectures are required to overcome the limitations of host-centric accelerators. Typically, accelerator devices require a local host CPU to configure and operate them. This limits the number of accelerators per host system. Network-attached accelerators are a new architectural approach for scaling the number of accelerators and host CPUs independently. In this paper, the communication architecture for network-attached accelerators is described which enables remote initialization and control of the accelerator devices. Furthermore, an operative prototype implementation is presented. The prototype accelerator node consists of an Intel Xeon Phi coprocessor and an EXTOLL NIC. The EXTOLL interconnect provides new features to enable accelerator-to-accelerator direct communication without a local host. Workloads can be dynamically assigned to CPUs and accelerators at run-time in an N to M ratio. The latency, bandwidth, and performance of the low-level implementation and MPI communication layer are presented. The LAMMPS molecular dynamics simulator is used to evaluate the communication architecture. The internode communication time is improved by up to 47%.
IEEE Transactions on Nuclear Science | 2013
Frank Lemke; Ulrich Bruening
This paper focuses on the design and concepts of the hierarchically structured Data Acquisistion (DAQ) network for the Compressed Baryonic Matter (CBM) experiment at the Facility for Antiproton and Ion Research (FAIR) in Darmstadt. Due to the limited physical space and the requirement for a multi TB/s DAQ within the system, this experiment requires compact read-out hardware which provides high bandwidth. Additionally, different detector types must be supported, which requires some flexibility regarding read-out structures and a common synchronization mechanism. Another important goal is to provide hardware that supports an identical interconnection protocol within the complete network to avoid protocol conversions. This led to the design and implementation of modular and unified read-out hardware and firmware.
ieee-npss real-time conference | 2012
Frank Lemke; Ulrich Bruening
This paper focuses on the design and concepts of the hierarchically structured Data Acquisition (DAQ) network for the Compressed Baryonic Matter (CBM) experiment at the Facility for Antiproton and Ion Research (FAIR) in Darmstadt. Due to the limited physical space and the requirement for a multi TB/s DAQ within the system, this experiment requires compact read-out hardware which provides high bandwidth. Additionally different detector types must be supported, which requires some flexibility regarding read-out structures and a common synchronization mechanism. In addition, an important goal is to provide hardware to support an identical interconnection protocol within the complete network to avoid protocol conversions. This led to the design and implementation of new general and unified read-out hardware and firmware.
conference on ph.d. research in microelectronics and electronics | 2013
Bastian Mohr; Jan Henning Mueller; Ye Zhang; Richard Leys; Sven Schenk; Ulrich Bruening; Stefan Heinen
This paper presents a high-speed serial PLL-less interface suitable for usage in mobile transmitters. The interface uses current mode signaling to reduce both ground bouncing and the crosstalk impact on the mobile frontend. A digital controlled delay line is employed to adjust the sampling point of the highspeed serial clock. The data is 8b/10b encoded for word recovery and signaling of configuration packets. The interface is self-initializing and distinguishes between signal and configuration data. It consists of three lanes from the FPGA to the ASIC and one lane in backward direction to debug the internal ASIC signals. The interface is able to transfer up to 1.6 Gbit/s per lane. It consumes 3mA from a 1.2V supply.
Proceedings of SPIE | 2011
Denis Wohlfeld; Frank Lemke; Holger Froening; Sven Schenk; Ulrich Bruening
Evolution in high performance computing (HPC) leads to increasing demands on bandwidth, connectivity and flexibility. Active optical cables (AOC) are of special interest, combining the benefits of electrical connectors and optical transmission. Optimization and development of AOC solutions requires enhancements concerning different technology barriers. Area and volume occupied by connectors is of special interest within HPC networks. This led to the development of a 12x AOC for the mini-HT connector creating the densest AOC available. In order to integrate electrical optical conversion into a module not higher than 3 mm, a new concept of coupling fibers to VCSELs or photodiodes had to be developed. This unique concept is based on a direct replication process of an integrated fiber coupler consisting of a 90° light deflecting and focusing mirror, a fiber guiding structure, and a fiber funnel. The integrated fiber coupler is directly replicated on top of active components, reducing the distance between active components and fibers to a minimum, thus providing a highly efficient light coupling. As AOC prototype, multi-chipmodules (MCM) including the complete electrical to optical conversion for send and receive connected by two 12x fiber ribbons have been developed. The paper presents the integrated fiber coupling technique and also design and measurement data of the prototype.
international conference on cluster computing | 2010
Heiner Litz; Maximilian Thuermer; Ulrich Bruening
So far, large computing clusters consisting of several thousand machines have been constructed by connecting nodes together using interconnect technologies as e.g. Ethernet, Infiniband or Myrinet. We propose an entirely new architecture called Tightly Coupled Cluster (TCCluster) that instead uses the native host interface of the processors as a direct network interconnect. This approach offers higher bandwidth and much lower communication latencies than the traditional approaches by virtually integrating the network interface adapter into the processor. Our technique neither applies any modifications to the processor nor requires any additional hardware. Instead, we use commodity off the shelf AMD processors and exploit the HyperTransport host interface as a cluster interconnect. Our approach is purely software based and does not require any additional hardware nor modifications to the existing processors. In this paper, we explain the addressing of nodes in such a cluster, the routing within such a system and the programming model that can be applied. We present a detailed description of the tasks that need to be addressed and provide a proof of concept implementation. For the evaluation of our technique a two node TCCluster prototype is presented. Therefore, the BIOS firmware, a custom Linux kernel and a small message library has been developed. We present microbenchmarks that show a sustained bandwidth of up to 2500 MB/s for messages as small as 64 Byte and a communication latency of 227 ns between two nodes outperforming other high performance networks by an order of magnitude.