Michael Kagan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael Kagan is active.

Explore More

Publication

Featured researches published by Michael Kagan.

high performance interconnects | 2005

Zero copy sockets direct protocol over infiniband-preliminary implementation and performance analysis

Dror Goldenberg; Michael Kagan; Ran Ravid; Michael S. Tsirkin

Sockets direct protocol (SDP) is a byte-stream transport protocol implementing the TCP SOCK/spl I.bar/STREAM semantics utilizing transport offloading capabilities of the infiniband fabric: Under the hood, SDP supports zero-copy (ZCopy) operation mode, using the infiniband RDMA capability to transfer data directly between application buffers. Alternatively, in buffer copy (BCopy) mode, data is copied to and from transport buffers. In the initial open-source SDP implementation, ZCopy mode was restricted to asynchronous I/O operations. We added a prototype ZCopy support for send()/recv() synchronous socket calls. This paper presents the major architectural aspects of the SDP protocol, the ZCopy implementation, and a preliminary performance evaluation. We show substantial benefits of ZCopy when multiple connections are running in parallel on the same host. For example, when 8 connections are simultaneously active, enabling ZCopy yields a bandwidth growth from 500 MB/s to 700 MB/s, while CPU utilization decreases 8 times.

Computer Science - Research and Development | 2011

The development of Mellanox/NVIDIA GPUDirect over InfiniBand--a new model for GPU to GPU communications

Gilad Shainer; Ali Ayoub; Pak Lui; Tong Liu; Michael Kagan; Christian R. Trott; Greg Scantlen; Paul S. Crozier

The usage and adoption of General Purpose GPUs (GPGPU) in HPC systems is increasing due to the unparalleled performance advantage of the GPUs and the ability to fulfill the ever-increasing demands for floating points operations. While the GPU can offload many of the application parallel computations, the system architecture of a GPU-CPU-InfiniBand server does require the CPU to initiate and manage memory transfers between remote GPUs via the high speed InfiniBand network. In this paper we introduce for the first time a new innovative technology—GPUDirect that enables Tesla GPUs to transfer data via InfiniBand without the involvement of the CPU or buffer copies, hence dramatically reducing the GPU communication time and increasing overall system performance and efficiency. We also explore for the first time the performance benefits of GPUDirect using Amber and LAMMPS applications.

grid computing | 2010

ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

Richard L. Graham; Stephen W. Poole; Pavel Shamis; Gil Bloch; Noam Bloch; Hillel Chapman; Michael Kagan; Ariel Shahar; Ishai Rabinovitz; Gilad Shainer

This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data dependency description provided by the user. This functionality is supported by the ConnectX-2 HCA, and provides the means for delegating collective communication management and progress to the HCA, also known as collective communication offload. This provides a means for overlapping collective communications managed by the HCA and computation on the Central Processing Unit (CPU), thus making it possible to reduce the impact of system noise on parallel applications using collective operations. This paper further describes how this new capability can be used to implement scalable Message Passing Interface (MPI) collective operations, describing the high level details of how this new capability is used to implement the MPI Barrier collective operation, focusing on the latency sensitive performance aspects of this new capability. This paper concludes with small scale bench- mark experiments comparing implementations of the barrier collective operation, using the new network offload capabilities, with established point-to-point based implementations of these same algorithms, which manage the data flow using the central processing unit. These early results demonstrate the promise this new capability provides to improve the scalability of high- performance applications using collective communications. The latency of the HCA based implementation of the barrier is similar to that of the best performing point-to-point based implementation managed by the central processing unit, starting to outperform these as the number of processes involved in the collective operation increases.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities

Richard L. Graham; Stephen W. Poole; Pavel Shamis; Gil Bloch; Noam Bloch; Hillel Chapman; Michael Kagan; Ariel Shahar; Ishai Rabinovitz; Gilad Shainer

This paper explores the computation and communication overlap capabilities enabled by the new CORE-Direct hardware capabilities introduced in the InfiniBand Network Interface Card (NIC) ConnectX-2. We use the latency dominated nonblocking barrier algorithm in this study, and find that at 64 process count, a contiguous time slot of about 80% of the nonblocking barrier time is available for computation. This time slot increases as the number of processes participating increases. In contrast, Central Processing Unit (CPU) based implementations provide a time slot of up to 30% of the nonblocking barrier time. This bodes well for the scalability of simulations employing offloaded collective operations. These capabilities can be used to reduce the effects of system noise, and when using non-blocking collective operations may also be used to hide the effects of application load imbalance.

international conference on cluster computing | 2005

Transparently Achieving Superior Socket Performance Using Zero Copy Socket Direct Protocol over 20Gb/s InfiniBand Links

Dror Goldenberg; Michael Kagan; Ran Ravid; Michael S. Tsirkin

Sockets Direct Protocol (SDP) is a byte stream protocol that utilizes the capabilities of the InfiniBand fabric to transparently achieve performance gains for existing socket-based networked applications. In this paper we discuss an implementation of Zero Copy support for synchronous send()/recv() socket calls, that uses the remote DMA capability of InfiniBand for SDP data transfers. We added this support to the open-source implementation of SDP over InfiniBand. We evaluate this implementation over a 20 Gb/s InfiniBand link. We demonstrate scalability of Zero Copy and show its benefits for systems that utilize multiple socket connections in parallel. For example, enabling Zero Copy with 8 active connections yields a bandwidth growth from 630MB/s to 1360MB/s, at the same time reducing the CPU utilization by a factor often

Computer Science - Research and Development | 2013

The co-design architecture for exascale systems, a novel approach for scalable designs

Gilad Shainer; Todd Wilde; Pak Lui; Tong Liu; Michael Kagan; Mike Dubman; Yiftah Shahar; Richard L. Graham; Pavel Shamis; Steve Poole

High performance computing (HPC) has begun scaling beyond the Petaflop range towards the Exaflop (1000 Petaflops) mark. One of the major concerns throughout the development toward such performance capability is scalability—both at the system level and the application layer. In this paper we present a novel approach for a new design concept—the co-design approach with enables a tighter development of both the application communication libraries and the underlying hardware interconnect solution in order to overcome scalability issues and to enable a more efficient design approach towards Exascale computing. We have suggested a new application programing interface and have demonstrated a 50x improvement of performance and scalability increases.

high performance interconnects | 2009

Optics for Enabling Future HPC Systems

Gilad Shainer; Eyal Gutkind; Bill Lee; Michael Kagan; Yevgeny Kliteynik

Increasing demand for computing power in scientific and engineering applications has spurred deployment of high-performance computing clusters. According to the TOP500 list, an industry respected report of the most powerful computer systems, the high-performance computing market entered the Teraflop era in 2005 (the entry point on the list became greater than1 Teraflop) and anticipates entering the Petaflop era in 2015. Future HPC systems that are capable of running large-scale parallel applications will span thousands to tens-of-thousands of nodes; all connected together via high-speed connectivity solutions to form a Peta to multi-Petaflop clusters. There are several architectural approaches to interconnect nodes together to construct an HPC system, such as the use of a Fat-Tree (CBB) or a 3-D Torus. The overall number of communication links grows with the size of the system. The physical medium for those Links have become a growing concern for large-scale platforms, as they tend to impact the system architecture, the system reliability and its cost. In the paper we review the requirements for HPC systems cabling, and in particular Optical cables. Optical cables capabilities will become the main limitation for building systems at any scale according to the CPU, memory and interconnect roadmaps in the next few years.

Archive | 2002