Bernard Metzler
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bernard Metzler.
IEEE Network | 2003
Robert Haas; Lukas Kencl; Andreas Kind; Bernard Metzler; Roman A. Pletka; Marcel Waldvogel; Laurent Frelechoux; Patrick Droz; Clark Jeffries
In this article we present five case studies of advanced networking functions that detail how a network processor (NP) can provide high performance and also the necessary flexibility compared with ASIC. We first review the basic NP system architectures, and describe the IBM PowerNP architecture from the data plane as well as the control plane point of view. We introduce models for the programmers views of NP that facilitate a global understanding of NP software programming. Then, for each case study, we present results from prototypes as well as general considerations that apply to a wider range of system architectures. Specifically, we investigate the suitability of NP for QoS (active queue management and traffic engineering), header processing (GPRS tunneling protocol), intelligent forwarding (load balancing without flow disruption), payload processing (code interpretation and just-in-time compilation in active networks), and protocol stack termination (SCTP). Finally, we summarize the key features as revealed by each case study, and conclude with remarks on the future of NP.
1999 IEEE Second Conference on Open Architectures and Network Programming. Proceedings. OPENARCH '99 (Cat. No.99EX252) | 1999
Bernard Metzler; Till Harbaum; Ralph Wittmann; Martina Zitterbart
AMnet flexibly provides communication services inside the network. It is based on active networking and on a hardware/software codesign in order to improve efficiency. Group communication is explicitly addressed since it is an important paradigm for existing and emerging networked applications. The goal of the AMnet approach is the provision of scalable quality-based support for heterogeneous group communication. It uses so-called service modules for efficient and flexible service support within intermediate systems. This paper gives an overview of AMnet. The design of an AMnode as an active intermediate system with hardware-supported service capabilities is presented. Furthermore, a simple control and signalling suite for heterogeneous multicast services is proposed.
asia pacific workshop on systems | 2011
Animesh Trivedi; Bernard Metzler; Patrick Stuedi
Modern cloud computing infrastructures are steadily pushing the performance of their network stacks. At the hardware-level, already some cloud providers have upgraded parts of their network to 10GbE. At the same time there is a continuous effort within the cloud community to improve the network performance inside the virtualization layers. The low-latency/high-throughput properties of those network interfaces are not only opening the cloud for HPC applications, they will also be well received by traditional large scale web applications or data processing frameworks. However, as commodity networks get faster the burden on the end hosts increases. Inefficient memory copying in socket-based networking takes up a significant fraction of the end-to-end latency and also creates serious CPU load on the host machine. Years ago, the supercomputing community has developed RDMA network stacks like Infiniband that offer both low end-to-end latency as well as a low CPU footprint. While adopting RDMA to the commodity cloud environment is difficult (costly, requires special hardware) we argue in this paper that most of the benefits of RDMA can in fact be provided in software. To demonstrate our findings we have implemented and evaluated a prototype of a software-based RDMA stack. Our results, when compared to a socket/TCP approach (with TCP receive copy offload) show significant reduction in end-to-end latencies for messages greater than modest 64kB and reduction of CPU load (w/o TCP receive copy offload) for better efficiency while saturating the 10Gbit/s link.
international conference on supercomputing | 2014
Felix Schürmann; Fabien Delalondre; Pramod S. Kumbhar; John Biddiscombe; Miguel Gila; Davide Tacchella; Alessandro Curioni; Bernard Metzler; Peter Morjan; Joachim Fenkes; Michele M. Franceschini; Robert S. Germain; Lars Schneidenbach; T. J. C. Ward; Blake G. Fitch
Storage class memory is receiving increasing attention for use in HPC systems for the acceleration of intensive IO operations. We report a particular instance using SLC FLASH memory integrated with an IBM BlueGene/Q supercomputer at scale Blue Gene Active Storage, BGAS. We describe two principle modes of operation of the non-volatile memory: 1 block device; 2 direct storage access DSA. The block device layer, built on the DSA layer, provides compatibility with IO layers common to existing HPC IO systems POSIX, MPIO, HDF5 and is expected to provide high performance in bandwidth critical use cases. The novel DSA strategy enables a low-overhead, byte addressable, asynchronous, kernel by-pass access method for very high user space IOPs in multithreaded application environments. Here, we expose DSA through HDF5 using a custom file driver. Benchmark results for the different modes are presented and scale-out to full system size showcases the capabilities of this technology.
symposium on cloud computing | 2014
Patrick Stuedi; Animesh Trivedi; Bernard Metzler; Jonas Pfefferle
Remote Procedure Call (RPC) has been the cornerstone of distributed systems since the early 80s. Recently, new classes of large-scale distributed systems running in data centers are posing extra challenges for RPC systems in terms of scaling and latency. We find that existing RPC systems make very poor usage of resources (CPU, memory, network) and are not ready to handle these upcoming workloads. In this paper we present DaRPC, an RPC framework which uses RDMA to implement a tight integration between RPC message processing and network processing in user space. DaRPC efficiently distributes computation, network resources and RPC resources across cores and memory to achieve a high aggregate throughput (2-3M ops/sec) at a very low per-request latency (10μs with iWARP). In the evaluation we show that DaRPC can boost the RPC performance of existing distributed systems in the cloud by more than an order of magnitude for both throughput and latency.
symposium on cloud computing | 2013
Patrick Stuedi; Bernard Metzler; Animesh Trivedi
Network latency has become increasingly important for data center applications. Accordingly, several efforts at both hardware and software level have been made to reduce the latency in data centers. Limited attention, however, has been paid to network latencies of distributed systems running inside an application container such as the Java Virtual Machine (JVM) or the .NET runtime. In this paper, we first highlight the latency overheads observed in several well-known Java-based distributed systems. We then present jVerbs, a networking framework for the JVM which achieves bare-metal latencies in the order of single digit microseconds using methods of Remote Direct Memory Access (RDMA). With jVerbs, applications are mapping the network device directly into the JVM, cutting through both the application virtual machine and the operating system. In the paper, we discuss the design and implementation of jVerbs and demonstrate how it can be used to improve latencies in some of the popular distributed systems running in data centers.
Ibm Journal of Research and Development | 2010
Fredy D. Neeser; Bernard Metzler; Philip Werner Frey
Remote direct memory access (RDMA) allows for the minimization of CPU and memory bus loads associated with network I/O. The Transmission Control Protocol/Internet Protocol (TCP/IP)-based Internet Wide Area RDMA Protocol (iWARP) stack now makes RDMA available for Ethernet local area networks and wide area networks. As 10-Gb/s Ethernet becomes deployed in data centers and Ethernet link speeds continue to increase faster than the memory bus bandwidth, the capability of RDMA to eliminate all intrahost data copy operations related to network I/O makes it attractive for accelerating TCP/IP. Whereas RDMA network adapters offload iWARP/TCP functionality to dedicated hardware, we have designed an onloaded iWARP software implementation, called SoftRDMA, which runs on the host CPU, closely integrated with Linux® TCP kernel sockets. SoftRDMA offers asynchronous nonblocking user-space I/O. It enables iWARP with conventional Ethernet adapters, as well as in mixed iWARP hardware and software environments, facilitating RDMA system integration. Furthermore, SoftRDMA benefits client-server applications with asymmetric loads and high aggregate throughput on the server. With iWARP checksumming turned off, SoftRDMA delivers a throughput exceeding 5 Gb/s using a single CPU core. We suggest zero-copy transmission in software and hardware acceleration for the iWARP framing layer for even higher performance.
international parallel and distributed processing symposium | 2016
Stefan Eilemann; Fabien Delalondre; Jon Bernard; Judit Planas; Felix Schuermann; John Biddiscombe; Costas Bekas; Alessandro Curioni; Bernard Metzler; Peter Kaltstein; Peter Morjan; Joachim Fenkes; Ralph Bellofatto; Lars Schneidenbach; T. J. Christopher Ward; Blake G. Fitch
Scientific workflows are often composed of compute-intensive simulations and data-intensive analysis and visualization, both equally important for productivity. High-performance computers run the compute-intensive phases efficiently, but data-intensive processing is still getting less attention. Dense non-volatile memory integrated into super-computers can help address this problem. In addition to density, it offers significantly finer-grained I/O than disk-based I/O systems. We present a way to exploit the fundamental capabilities of Storage-Class Memories (SCM), such as Flash, by using scalable key-value (KV) I/O methods instead of traditional file I/O calls commonly used in HPC systems. Our objective is to enable higher performance for on-line and near-line storage for analysis and visualization of very high resolution, but correspondingly transient, simulation results. In this paper, we describe 1) the adaptation of a scalable key-value store to a BlueGene/Q system with integrated Flash memory, 2) a novel key-value aggregation module which implements coalesced, function-shipped calls between the clients and the servers, and 3) the refactoring of a scientific workflow to use application-relevant keys for fine-grained data subsets. The resulting implementation is analogous to function-shipping of POSIX I/O calls but shows an order of magnitude increase in read and a factor 2.5x increase in write IOPS performance (11 million read IOPS, 2.5 million write IOPS from 4096 compute nodes) when compared to a classical file system on the same system. It represents an innovative approach for the integration of SCM within an HPC system at scale.
virtual execution environments | 2015
Jonas Pfefferle; Patrick Stuedi; Animesh Trivedi; Bernard Metzler; Ioannis Koltsidas; Thomas R. Gross
DMA-capable interconnects, providing ultra-low latency and high bandwidth, are increasingly being used in the context of distributed storage and data processing systems. However, the deployment of such systems in virtualized data centers is currently inhibited by the lack of a flexible and high-performance virtualization solution for RDMA network interfaces. In this work, we present a hybrid virtualization architecture which builds upon the concept of separation of paths for control and data operations available in RDMA. With hybrid virtualization, RDMA control operations are virtualized using hypervisor involvement, while data operations are set up to bypass the hypervisor completely. We describe HyV (Hybrid Virtualization), a virtualization framework for RDMA devices implementing such a hybrid architecture. In the paper, we provide a detailed evaluation of HyV for different RDMA technologies and operations. We further demonstrate the advantages of HyV in the context of a real distributed system by running RAMCloud on a set of HyV-enabled virtual machines deployed across a 6-node RDMA cluster. All of the performance results we obtained illustrate that hybrid virtualization enables bare-metal RDMA performance inside virtual machines while retaining the flexibility typically associated with paravirtualization.
conference on emerging network experiment and technology | 2013
Animesh Trivedi; Bernard Metzler; Patrick Stuedi; Thomas R. Gross
The performance of large-scale data-intensive applications running on thousands of machines depends considerably on the performance of the network. To deliver better application performance on rapidly evolving high-bandwidth, low-latency interconnects, researchers have proposed the use of network accelerator devices. However, despite the initial enthusiasm, translating network accelerators capabilities into high application performance remains a challenging issue. In this paper, we describe our experience and discuss issues that we uncover with network acceleration using Remote Direct Memory Access (RDMA) capable network controllers (RNICs). RNICs offload the complete packet processing into network controllers, and provide direct userspace access to the networking hardware. Our analysis shows that multiple (un)related factors significantly influence the performance gains for the end-application. We identify factors that span the whole stack, ranging from low-level architectural issues (cache and DMA interaction, hardware pre-fetching) to the high-level application parameters (buffer size, access pattern). We discuss implications of our findings upon application performance and the future of integration of network acceleration technology within the systems.