Ulrich Brüning | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ulrich Brüning is active.

Explore More

Publication

Featured researches published by Ulrich Brüning.

ACM Transactions on Reconfigurable Technology and Systems | 2008

An open-source HyperTransport core

David Slogsnat; Alexander Giese; Mondrian Nüssle; Ulrich Brüning

This article presents the design of a generic HyperTransport (HT) core. HyperTransport is a packet-based interconnect technology for low-latency, high-bandwidth point-to-point connections. It is specially optimized to achieve a very low latency. The core has been verified in system using an FPGA. This exhaustive verification and the generic design allow the mapping to both ASICs and FPGAs. The implementation described in this work supports a 16-bit link width, as used by Opteron processors. On a Xilinx Virtex-4 FX60, the core supports a link frequency of 400 MHz DDR and offers a maximum bidirectional bandwidth of 3.2GB/s. The in-system verification has been performed using a custom FPGA board that has been plugged into a HyperTransport extension connector (HTX) of a standard Opteron-based motherboard. HTX slots in Opteron-based motherboards allow very high-bandwidth, low-latency communication, since the HTX device is directly connected to one of the HyperTransport links of the processor. Performance analysis shows a unidirectional payload bandwidth of 1.4GB/s and a read latency of 180 ns. The HT core in combination with the HTX board is an ideal base for prototyping systems and implementing FPGA coprocessors. The HT core is available as open source.

international conference on parallel processing | 2009

A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication

Mondrian Nüssle; Martin Scherer; Ulrich Brüning

This paper introduces a new highly optimized architecture for remote memory access (RMA). RMA, using put and get operations, is a one-sided communication function which amongst others is important in current and upcoming Partitioned Global Address Space (PGAS) systems. In this work, a virtualized hardware unit is described which is resource optimized, exhibits high overlap, processor offload and very good latency characteristics. To start an RMA operation a single HyperTransport packet caused by one CPU instruction is sufficient, thus reducing latency to an absolute minimum. In addition to the basic architecture an implementation in FPGA technology is presented together with an evaluation of the target ASIC-implementation. The current system can sustain more than 4.9 million transactions per second on the FPGA and exhibits an end-to-end latency of 1.2 μs for an 8-byte put operation. Both values are limited by the FPGA technology used for the prototype implementation. An estimation of the performance reachable on ASIC technology suggests that application to application latencies of less than 500 ns are feasible.

ieee/acm international symposium cluster, cloud and grid computing | 2013

On Achieving High Message Rates

Holger Fröning; Mondrian Nüssle; Heiner Litz; Christian Leber; Ulrich Brüning

Computer systems continue to increase in parallelism in all areas. Stagnating single thread performance as well as power constraints prevent a reversal of this trend, on the contrary, current projections show that the trend towards parallelism will accelerate. In cluster computing, scalability, and therefore the degree of parallelism, is limited by the network interconnect and more specifically by the message rate it provides. We designed an interconnection network specifically for high message rates. Among other things, it reduces the burden on the software stack by relying on communication engines that perform a large fraction of the send and receive functionality in hardware. It also supports multi-core environments very efficiently through hardware-level virtualization of the communication engines. We provide details on the overall architecture, the thin software stack, performance results for a set of MPI-based benchmarks, and an in-depth analysis of how application performance depends on the message rate. We vary the message rate by software and hardware techniques, and measure the application-level impact of different message rates. We are also using this analysis to extrapolate performance for technologies with wider data paths and higher line rates.

international conference on networks | 2010

A Case for FPGA Based Accelerated Communication

Holger Fröning; Mondrian Nüssle; Heiner Litz; Ulrich Brüning

The use of Field Programmable Gate Arrays (FPGAs) in the area of High Performance Computing (HPC) to accelerate computations is well known. We present here a case where FPGAs can be used to speed up communication instead of computation. Current interconnects for HPC are in particular missing support for fine grain communication, which is increasingly found in various applications. In order to overcome this situation we developed a novel custom network. By using solely FPGAs it can easily be reconfigured to custom needs. The main drawback of FPGAs is their limited performance, which is about one to two orders of magnitude slower than commercial (specialized) solutions. However, an architecture optimized for small packet sizes results in a performance superior even to commercial high performance solutions. This excellent communication performance is verified by results from several popular benchmarks. In summary, we present a case where FPGAs can be used to accelerate communication and outperform commercial interconnection networks for HPC.

reconfigurable computing and fpgas | 2009

An FPGA-Based Custom High Performance Interconnection Network

Mondrian Nüssle; Benjamin Geib; Holger Fröning; Ulrich Brüning

An FPGA-based prototype of a custom high-performance network hardware has been implemented, integrating both a switch and a network interface in one FPGA. The network interfaces to the host processor over HyperTransport. About 85% of the slices of a Virtex IV FX100 FPGA are occupied and 10 individual clock domains are used. Six of the MGT-blocks of the device implement high-speed links to other nodes. Together with the integrated switch it is thus possible to build topologies with a node degree of up to 6, i.e. a 3D-torus or a 6D Hypercube. The target clock rate is 156 MHz with the links running at 6.24 Gbit/s and 200 MHz for the HyperTransport Core. This goal was reached with a 32-bit wide data path in the network-switch and link blocks. The integrated switch reaches an aggregate bandwidth of more than 45 Gbit/s. The resulting interconnection network features a very low latency – between nodes and including switching - close to 1 µs.

Archive | 2013

Accelerate Communication, not Computation!

Mondrian Nüssle; Holger Fröning; Sven Kapferer; Ulrich Brüning

Computer systems are showing a continuously increasing degree of parallelism in all areas. Stagnating single thread performance as well as power constraints prevent a reversal of this trend. On the contrary, current projections show that the trend towards parallelism will accelerate. In cluster computing scalability and therefore the degree of parallelism are limited by the network interconnect and its characteristics like latency, message rate, overlap and bandwidth. While most interconnection networks focus on improving bandwidth, there are many applications that are very sensitive to latency, message rate and overlap, too. We present an interconnection network called EXTOLL, which is specifically designed to improve characteristics like latency, message rate and overlap, rather than focusing solely on improving bandwidth. Key techniques to achieve this are designing EXTOLL as an integral part of the HPC system, providing dedicated support for multi-core environments and designing and optimizing EXTOLL from scratch for the needs of high performance computing. The most important parts of EXTOLL are the network interface and the network switch, which is a crucial resource when scaling the network. EXTOLL’s network interface provides dedicated support for small messages for eager communication, and for bulk transfers in the form of rendezvous communication. While support for small messages is optimized mainly for high message rates and low latencies, for bulk transfers the possible amount of overlap between communication and computation is optimized. EXTOLL is completely based on FPGA technology, both for the network interface and the switching. In this work we present a case for accelerated communication, where FPGAs are not used to speed up computational processes, rather we employ FPGAs to speed up communication. We will show that in spite of the inferior performance characteristics of FPGAs compared to ASIC solutions, we can dramatically accelerate communication tasks and thus reduce the overall execution time.

Proceedings of the Second International Symposium on Memory Systems | 2016

Exploring Time and Energy for Complex Accesses to a Hybrid Memory Cube

Juri Schmidt; Holger Fröning; Ulrich Brüning

Through-Silicon Vias (TSVs) and three-dimensional die stacking technologies are enabling a combination of DRAM and CMOS die layer within a single stack, leading to stacked memory. Functionality that was previously associated with the microprocessor, e.g. memory controllers, can now be integrated into the memory cube, allowing to packetize the interface for improved performance and reduced energy consumption per bit. Complex memory networks become feasible as the logic layer can include routing functionality. The massive amount of connectivity among the different die layers by the use of TSVs in combination with the packetized interface leads to a substantial improvement of memory access bandwidth. However, leveraging this vast bandwidth increase from an application point of view is not as simple as it seems. In this paper, we point out multiple pitfalls when accessing a stacked memory, namely the Hybrid Memory Cube (HMC) in combination with the publicly available openHMC host controller. HMCs internal architecture still has many similarities with traditional DRAM chips like the page-based access, but it is internally partitioned into multiple vaults. Each vault comprises a memory controller and multiple DRAM banks. Pages are rather small and rely on a closed-page policy. Also, the ratio of read and write operations has an optimum of which the application should be aware. The built-in support for atomic operations sounds like a great opportunity for off-loading, but the impact of contention cannot be neglected. Besides exploring such performance pitfalls, we further start exploring the energy efficiency of memory accesses to stacked memory.

reconfigurable computing and fpgas | 2015

openHMC - a configurable open-source hybrid memory cube controller

Juri Schmidt; Ulrich Brüning

The link between the processor and memory is one of the last remaining parallel buses and a major performance bottleneck in computer systems. The Hybrid Memory Cube (HMC) was developed with the goal of helping overcome this memory wall. In contrast to DDRx memory interfaces, the HMC host interface is serial and packetized. This paper presents a vendor-agnostic, open-source implementation of an HMC host controller that can be configured for different datapath widths and HMC link variations. Due to its modular design, the controller can be integrated in many different system environments. In-system verification was performed using a Xilinx Ultrascale VU095 FPGA. Overall, the presented controller is a mature, freely available solution for experimenting with the HMC and evaluating its capabilities.

international conference on networks | 2009

Efficient Virtualization of High-Performance Network Interfaces

Holger Fröning; Heiner Litz; Ulrich Brüning

The architecture of modern computing systems is getting more and more parallel, in order to exploit more of the offered parallelism by applications and to increase the systems overall performance. This includes multiple cores per processor module, multi-threading techniques and the resurgence of interest in virtual machines. In spite of this amount of parallelism the network interface is typically available only once. If the network interface is not able to exploit the offered parallelism, it becomes a bottleneck limiting the systems overall performance. To overcome this situation a new virtualization method for network interface is proposed, relying on a speculative mechanism to enqueue work requests. It offers unconstrained and parallel access without any involvement of software instances, allowing to exploit any available parallelism. For an unrestricted use the I/O interface is not modified. Modern parallel computing systems and Virtual Machine environments can be significantly improved with this new virtualization technique.

field-programmable logic and applications | 2009

An FPGA based verification platform for HyperTransport 3.x

Heiner Litz; Holger Fröning; Maximilian Thürmer; Ulrich Brüning

In this paper we present a verification platform designed for HyperTransport 3.x (HT3) applications. HyperTransport 3.x is a very low latency and high bandwidth chip-tochip interconnect which is particularly used in AMDs novel Opteron processor series. As it is an open protocol, a broad application range exists ranging from southbridge chips over closely coupled accelerators to add in cards. Its main advantage over PCI-Express is that it allows direct connection to the CPU resulting in significantly improved latency performance. To enable the development of new HyperTransport products we herein present the very first FPGA based prototyping platform for HT3.x. Such a platform is enormously valuable as new designs can be tested in real world systems before producing an costly application specific integrated circuit (ASIC). Due to the high operating frequencies of HT3.x an FPGA based solution is extremely challenging as we will describe in this paper. Our presented architecture is evaluated and implemented in the form of a printed circuit board (PCB). This add-in card represents the worlds first available HyperTransport 3 device. Early adopters of HT3 benefit from the results of this work for rapid prototyping and Hardware/Software coverification of new HT3 designs and products.

Explore More